PETER HENRICI 


ELEMENTS 
OF NUMERICAL 
ANALYSIS 


Created by Universal Document Conver 


ELEMENTS 
OF NUMERICAL 
ANALYSIS 


PETER HENRICI 


Professor of Mathematics 
Eidgendssische Technische Hochschule 
Ziirich, Switzerland 


John Wiley & Sons, Inc. New York - London : Sydney 


SECOND PRINTING, SEPTEMBER, 1965 


Copyright © 1964 by John Wiley & Sons, Inc. 


All Rights Reserved. This book or any part thereof must not be reproduced 
in any form without the written permission of the publisher. 


Library of Congress Catalog Card Number: 64-23840 
PRINTED IN THE UNITED STATES OF AMERICA 


to George Elmer Forsythe 


Created by Universal Document Converter 


PREFACE 


The present book originated in lecture notes for a course entitled ‘“Nu- 
merical Mathematical Analysis” which I have taught repeatedly at the 
University of California, Los Angeles, both in regular session and in a 
Summer Institute for Numerical Analysis, sponsored regularly by the 
National Science Foundation, whose participants are selected college 
teachers expecting to teach similar courses at their own institutions. 
The prerequisites for the course are 12 units of analytic geometry and 
calculus plus 3 units of differential equations. 

The teaching of numerical analysis in a mathematics department poses 
a peculiar problem. At a time when the prime objectives in the instruc- 
tion of most mathematical disciplines are rigor and logical coherence, 
many otherwise excellent textbooks in numerical analysis still convey 
the impression that computation is an art rather than a science, and that 
every numerical problem requires its own trick for its successful solution. 
It is thus understandable that many analysts are reluctant to take much 
interest in the teaching of numerical mathematics. As a consequence 
these courses are frequently taught by instructors who are not primarily 
concerned with the mathematical aspects of the subject. Thus little is 
done to whet the computational appetite of those students who feel 
attracted to mathematics for the sake of its rich logical structure and 
clarity, and our schools fail to turn out computer-trained mathematicians 
in the large numbers demanded by modern science and technology. 

Contrary to the view of computation as an art, I have always taken 
the attitude that numerical analysis is primarily a mathematical discipline. 
Thus I have tried to stress unifying principles rather than tricks, and to 
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establish connections with other branches of mathematical analysis. In- 
deed, if looked at in this manner, numerical analysis provides the in- 
structor with a wonderful opportunity to strengthen the student’s grasp 
of some of the basic notions of analysis, such as the idea of a sequence, 
of a limit, of a recurrence relation, or of the concept of a definite integral. 
To achieve a balance between practical and theoretical content, I have 
made — for the first time in a numerical analysis text — a clearcut dis- 
tinction between algorithms and theorems. An algorithm is a computa- 
tional procedure; a theorem is a statement about what an algorithm does. 

In addition to standard material the book features a number of modern 
algorithms (and corresponding theorems) that are not yet found in most 
textbooks of numerical analysis, such as: the quotient-difference al- 
gorithm, Muller’s method, Romberg integration, extrapolation to the 
limit, sign wave analysis, computation of logarithms by differentiation, 
and Steffensen iteration for systems of equations. In the last chapter I 
have given an elementary theory of error propagation that is sufficiently 
general to cover many algorithms of practical interest. 

Another novelty for a textbook in numerical analysis is my attempt 
to treat the theory of difference equations with the same amount of rigor 
and generality that is usually given to the theory of differential equations. 
Thus this important research tool becomes available for systematic use 
in a number of contexts. In fact, difference equations form one of the 
unifying themes of the book. 

On the basis of my teaching experience I have found it necessary to 
include a rather extensive chapter on complex numbers and polynomials. 
The modern trend of replacing the course in classical theory of equations 
by a course in linear algebra has had the rather curious consequence 
that, at the level for which this book is intended, students now are less 
familiar with the basic properties of the complex field than they used to be. 

The book contains about 300 problems of varying computational and 
analytical difficulty. (Some of the more demanding problems are marked 
by an asterisk.) In addition, a small number of research problems have 
been stated at the end of some of the chapters. Some of these problems 
are in the form of non-trivial theoretical assignments requiring library 
work. The others pose practical questions of some general interest and 
are intended to stimulate undergraduate research participation. Their 
solution usually presents a non-trivial challenge in experimental compu- 
tation. 

I have omitted all references to numerical methods in algebra and 
matrix theory, because I feel that this topic is best dealt with in a 
separate course. As its theoretical foundations are quite different, to 
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treat it with the same attitude that I have tried to adopt towards nu- 
merical analysis would roughly have doubled the size of the book. 

For a similar reason I have omitted material on programming and 
programming languages. It goes without saying, however, that no cur- 
riculum in numerical mathematics can be complete without permitting 
the student to acquire some experience in actual computation. At the 
Institute for Numerical Analysis, this experience is acquired in a simul- 
taneous three unit programming course (one unit lectures, two units 
laboratory). The subject matter in this book is deliberately arranged 
so that the easy-to-program algorithms occur early, and a course based 
on it can easily be synchronized with a programming course. 

A preliminary version of the book has been used on a trial basis at a 
number of schools, and I have been fortunate to receive a number of 
comments and constructive criticisms. In particular I wish to express 
my gratitude to Christian Andersen, P. J. Eberlein, Gene H. Golub, M. 
Melkanoff, Duane Pyle, T. N. Robertson, Sydney Spital, J. F. Traub, 
and Carroll Webber for suggestions which I have been able to incorporate 
in the final text. I also wish to thank Thomas Bray and Gordon Thomas 
of the Boeing Scientific Research Laboratories for assistance in planning 
some of the machine-computed examples. Finally I record my debt to 
my wife, Eleonore Henrici, for her unflinching help, far beyond the call 
of conjugal duty, in preparing both the preliminary and the final version 
of the manuscript. 

I dedicate this book to my mentor and former collegue, George E. 
Forsythe, who had a decisive influence on my view of the whole area of 
mathematical computation. 


Zurich, Switzerland P. HENRICI 
June 1964 
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chapter l what is numerical analysis? 


1.1 Attempt of a Definition 


Unlike other terms denoting mathematical disciplines, such as calculus 
or linear algebra, the exact extent of the discipline called numerical 
analysis is not yet clearly delineated. Indeed, as recently as twenty years 
ago this term was still practically unknown. It did not become generally 
used until the Institute of Numerical Analysis was founded at the Univer- 
sity of California in Los Angeles in 1947. But even today there are widely 
diverging views of the subject. On the one hand, numerical analysis is 
associated with all those activities which are more commonly known as 
data processing. These activities comprise—to quote only some of the 
more spectacular examples—such things as the automatic reservation of 
airline seats, the automatic printing of paychecks and telephone bills. the 
instantaneous computation of stock market averages, and the evaluation 
and interpretation of certain medical records such as electroencephalo- 
grams. On the other hand, the words ‘numerical analysis”? have 
connotations of endless arithmetical drudgery performed by mathematical 
clerks armed with a pencil, a tremendous sheet of paper, and the 
indispensable eraser. 

Between these extreme views we wish to steer a middle course. As far 
as this volume is concerned, we shall mean by numerical analysis the 
theory of constructive methods in mathematical analysis. The emphasis is 
on the word “constructive.” By a constructive method we mean a 
procedure that permits us to obtain the solution of a mathematical 
problem with an arbitrary precision in a finite number of steps that can be 
performed rationally. (The number of steps may depend on the desired 
accuracy.) Thus, the mere proof that nonexistence of the solution would 
lead to a logical contradiction does not represent a constructive method. 

A constructive method usually consists of a set of directions for the 
performance of certain arithmetical or logical operations in predetermined 
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order, much as a recipe directs the housewife to perform certain chemical 
operations. As with a good cookbook, this set of directions must be 
complete and unambiguous. A set of directions to perform mathe- 
matical operations designed to lead to the solution of a given problem is 
called an algorithm. The word algorithm was originally used primarily to 
denote procedures that terminate after a finite number of steps. Finite 
algorithms are suitable mainly for the solution of problems in algebra. 
The student is likely to be familiar with the following two examples: 


1. The Euclidean algorithm for finding the greatest common divisor of 
two positive integers. 

2. The Gaussian algorithm for solving a system of n linear equations with 
n unknowns. 


The problems occurring in analysis, however, usually cannot be solved 
in a finite number of steps. Unlike the recipes of a cookbook, the 
algorithms designed for the solution of problems in mathematical analysis 
thus necessarily consist of an infinite (although denumerable) sequence of 
operations. Only a finite number of these can be performed, of course, in 
any practical application. The idea is, however, that the accuracy of the 
answer increases with the number of steps performed. If a sufficient 
number of steps are performed, the accuracy becomes arbitrarily high. 


Problems 


1. Formulate a (finite) algorithm for deciding whether a given positive integer 
is a prime. 

2. Formulate an algorithm for computing V2 to an arbitrary number of 
decimal places. (The algorithm is necessarily infinite. Why?) 


1.2 A Glance at Mathematical History 


Confronted with the definition of a constructive method given above, an 
unspoiled student is likely to ask: Is not all mathematics constructive in 
the indicated sense? 

There indeed was a time when most of the work done in mathematics 
was not only inspired by concrete questions and problems but also aimed 
directly at solving these problems in a constructive manner. This was the 
period of those classical triumphs of mathematics that fill the layman with 
awe to the present day: The prediction of celestial phenomena such as 
eclipses of the sun or the moon, or the accurate prophecy of the appearance 
of acomet. Those predictions were possible, because the solutions of the 
underlying mathematical problem were not merely shown to exist, but 
were actually found by constructive methods. 
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The high point of this classical algorithmic age was perhaps reached in 
the work of Leonhard Euler (born 1707 in Basel, died 1781 in St. Peters- 
burg). Euler was possessed of a faith in the all-embracing power of 
mathematics which today appears almost naive. At the age of twenty, 
before he had ever seen the ocean, he won a competition of the Paris 
Academy of Sciences with an essay on the best way to distribute masts on 
sailing vessels. Innumerable numerical examples are dispersed in the 
(so far) seventy volumes of his collected works, showing that Euler always 
kept foremost in his mind the immediate numerical use of his formulas 
and algorithms. His infinite algorithms frequently appear in the form of 
series expansions. 

After Euler’s time, however, the faith in the numerical usefulness of an 
algorithm appears to have decreased slowly but steadily. While the 
problems subjected to mathematical investigation increased in scope and 
generality, mathematicians became interested in questions of the existence 
of their solutions rather than in their construction. It is true that up to 
1900 most existence theorems were proved by methods which we would 
call constructive; however, the computational demands made by these 
methods were such as to render absurd the idea of actually carrying 
through the construction. It is hardly conceivable, for instance, that 
Emile Picard (1856-1941) ever thought of going through the motions 
required by his iteration method for solving, say, a nonlinear partial 
differential equation. 

In view of the feeling of algorithmic impotence which must have per- 
vaded the mathematical climate at that time, it is easily understandable 
that mathematicians were increasingly inclined to use purely logical rather 
than constructive methods in their proofs. An early significant instance 
of this is the Bolzano-Weierstrass theorem (see Buck [1959], p. 10) where 
we are required infinitely many times to decide whether a given set is 
finite or not. To make this decision even once is, in general, not possible 
by any constructive method. During the second half of the 19th century 
the nonconstructive, purely logical trend in mathematics was rapidly 
picking up momentum. Some main stations of this development are 
marked by the names of Dedekind (1831-1916), Cantor (1845-1918), 
Zermelo (1871-1953). In spite of some countertrends inspired by out- 
standing mathematicians such as Hermann Weyl (1885-1955), the logical 
point of view appeared to be steering towards an almost absolute victory 
near the end of the first half of the 20th century. Mathematics finally 
seemed at the threshold of making true the proud statement of Jacobi 
(1804-1851) that: “‘ Mathematics serves but the honor of the human spirit.” 

By a strange coincidence, algorithmic mathematics has been liberated 
from the vincula of numerical drudgery at the very moment when pure 
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mathematics finally seems to have liberated itself from the last ties of 
algorithmic thought. From the beginnings of the art of computation to 
the early 1940’s the speed of computation had been increased by a factor 
of about ten, due to the invention of various computing devices which 
today seem primitive. Since the early 1940’s the speed of computation 
has increased a millionfold due to the invention of the electronic digital 
computer. To put into effect even the most complicated algorithms 
presents no difficulty today. Asaconsequence, the demand for algorithms 
of all kinds has increased enormously. 


Problem 


3. Look up Dedekind’s axiom concerning the so-called Dedekind sections 
(see Taylor [1959], p. 447) and give three examples of its application in 
elementary calculus. 


1.3. Polynomial Equations: An Illustration 


To illustrate the types of problems and concepts we wish to deal with, 
we will consider the problem of solving polynomial equations. We are all 
familiar with the quadratic equation 


(1-1) x? + a,x + ag = 0, 


where a, and a, denote arbitrary real numbers. It is well known that the 
solutions of this equation are given by the formula 


(1-2) x = —fa, + [Ga,)? — a)”; 


in certain cases the computation of the square root leads to complex 
numbers (see chapter 2). Formula (1-2) states more than the mere 
existence of a solution of equation (1-1); in fact it indicates an algorithm 
which permits us to calculate the solution. More precisely, the problem 
of computing the solution is reduced to the simpler problem of computing 
a square root. If an instrument for computing roots is available (this 
could be a table of logarithms, a computing machine especially pro- 
grammed for the purpose, or the reader of this book armed with the 
algorithm he developed when solving problem 2), then any quadratic 
equation can be solved. Is the same also true for equations of higher 
degree? 
Let there be given an arbitrary polynomial of degree n, 


(1-3) p(x) = x + ax"? 4 agx™-? 4 + a; 


the problem is to find the solutions of the equation p(x) = 0. In the 
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golden age of algorithmic mathematics Scipione dal Ferro (1496-1525) 
and Rafaello Bombelli (L’ Algebra, 1579) discovered that in the cases 
n = 3andn = 4 there exist algorithms for finding all solutions of equation 
(1-3) that merely require an instrument for computing roots. Through 
several centuries attempts were made to solve equations of higher than the 
fourth degree in a similar manner, but all these efforts remained fruitless. 
Finally Galois (1811-1832) proved, in a paper written on the eve of his 
premature death in a duel, that it is not possible in the case n > 4 to 
compute the solutions of equation (1-3) with an instrument that merely 
calculates roots. This, then, is a typical instance of nonexistence of a 
certain type of algorithm. Modern numerical analysis, too, knows of 
such instances of nonexistence of algorithms. 

If a problem cannot be solved with an algorithm of a certain type, this 
does not mean that the problem cannot be solved at all. The problem is 
merely that of discovering a new algorithm. Today there are no practical 
limits to mathematical inventiveness in the discovery of new algorithms. 
It must be admitted, though, that the talent for discovering algorithms is 
in no way confined to mathematicians. Some of the most effective 
algorithms of numerical analysis have been discovered by aerodynamicists 
such as L. Bairstow (1880- ); by astronomers such as P. L. Seidel 
(1821-1896); by the meteorologist L. F. Richardson (1881-1953); and by 
the statistician A. C. Aitken (1895- ): 

The problem of finding the solutions of a polynomial equation of 
arbitrary degree has attracted mathematicians of many generations such 
as Newton (1643-1727), Bernoulli (1700-1782), Fourier (1768-1830), 
Laguerre (1834-1886), and even today significant contributions to the 
problem are made almost every year. One of the simplest algorithms for 
solving equation (1-3) is due to Daniel Bernoulli. If a sequence of 
numbers Z,, Ze, Z3,... 1s determined by setting z, = 0 fork < 0, z) = 1, 
and calculating z, for k > 0 by means of the recurrence relation 


(1-4) 2, = —AyZ,-1 — AgZ_-g — °° —AnZe-n (A = 1,2,...), 


then the sequence of quotients 


Zk+1 
(1-5) qk = Ze 
—if certain hypotheses are fulfilled—tends to a solution of equation (1-3) 
(see chapter 7). Here we have a typical example of an infinite algorithm, 
for the sequence {z,} mever terminates. The recursive nature of Ber- 
noulli’s process—the same formula (1-4) is evaluated over and over 
again—also is typical of many processes of modern numerical analysis. 
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An algorithm is not well defined unless there can be no ambiguity 
whatsoever about the operation to be performed next. Obvious causes 
for the breakdown of an algorithm are divisions by zero, or (in the real 
domain) square roots of negative numbers. More generally, an algorithm 
will always break down if a function is to be evaluated at a point where it is 
undefined. All such occurrences must be foreseen and avoided in the 
statement of the algorithm. 


EXAMPLES 
3. Let an algorithm be defined as follows: Choose Zp arbitrarily in the 
interval (0, 2), and compute Z, Z2,... by the recurrence relation 
1 
Ze+1 = oe 


This algorithm is not well defined, since the formula does not make sense 
when z, = 2, which happens to be the case for instance, if z) = 4 and 
k = 3. The algorithm can be turned into a well-defined one by a state- 
ment such as the following: ‘‘ Whenever z, = 2, set z,4,, = 13.” 

4. Bernoulli’s algorithm described in §1.3 is well defined to the extent 
that the right-hand side of equation (1-4) always has a meaning and the 
sequence {z,} thus always exists. However, it still could be the case that 
infinitely many z,’s are zero. The corresponding elements g, would then 
be undefined. (Consider, for example, the polynomial p(z) = z? — 1.) 
Under suitable hypotheses on the polynomial (1-3) it can be shown, 
however, that the quotients g, are always defined for & sufficiently large 
(see chapter 7). 


Many simple algorithms can be perfectly well described by means of the 
conventional symbolism of algebra, supplemented, if necessary, by English 
sentences. In this manner we shall for instance describe most of the 
algorithms given in this book. Ordinary algebraic language is not the 
only language, however, in which an algorithm can be expressed. Asa 
matter of fact, recent experience has shown that for very involved 
algorithms, especially those occurring in numerical linear algebra, the 
traditional mathematical language is sometimes grossly inadequate. 

Ordinary mathematical language has one further disadvantage: It 
cannot be understood directly by computing machines without first being 
‘“‘coded.”’ In the interest of breaking down the communication barrier 
between man and machine as well as between man and man, it becomes 
very desirable to describe algorithms in a language that can be understood 
by both. 
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An algorithmic language with this property, called FORTRAN 
(= formula translator), was created around 1957 by the International 
Business Machines Corporation (IBM). It is very widely used not only 
on IBM machines, but on other machines as well. FORTRAN is a 
completely adequate tool (especially in view of some recent refinements) 
for describing and communicating to the machine the vast majority of 
algorithms currently used in computational practice. It is strongly 
suggested that the student of this book familiarize himself with FORTRAN 
and use it to solve the computational problems posed in subsequent 
chapters. Excellent introductions to FORTRAN are available 
(McCracken [1961]). 

FORTRAN was designed primarily as a practical research tool. Its 
great advantages are its simplicity and its wide circulation. From a 
theoretical point of view, though, the FORTRAN language has some 
serious shortcomings, and some of its rules appear highly artificial. 
Realizing this, an international team of computer scientists gathered in 
1957 with the aim of creating a universal algorithmic language that would 
be satisfactory from a theoretical point of view and that would not suffer 
from any artificial limitations. The result of their efforts was the language 
known as ALGOL (= algorithmic /anguage). A revised version called 
ALGOL 60 has found wide acceptance, especially in Europe. At the time 
this is being written, there are not yet many machines in the United 
States equipped to read ALGOL, although there are reasons to believe 
that this will change in the future. ALGOL is described in an official 
document of the Algol committee (Naur [1960]); there are also some less 
formal, but more readable, introductions to the language (McCracken 
[1962], Schwarz [1962]). 


1.5 Convergence and Stability 


Once an algorithm is properly formulated, we wish to know the exact 
conditions under which the algorithm yields the solution of the problem 
under consideration. If, as is most commonly the case, the algorithm 
results in the construction of a sequence of numbers, we wish to know the 
conditions for convergence of this sequence. The practitioner of the art 
of computation is frequently inclined to judge the performance of an 
algorithm in a purely pragmatic way: The algorithm has been tried out in 
a certain number of examples, and it has worked satisfactorily in 95 
per cent of all cases. 

Mathematicians tend to take a dim view of this type of scientific 
investigation (although it is basically the standard method of research in 
such vital disciplines as medicine and biology). It is indeed always 
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desirable to base a statement about the performance of a given algorithm 
on logic rather than empirical evidence. Such logically provable state- 
ments are called theorems in mathematics. As we shall see, theorems 
about the convergence of algorithms can be stated in a good many cases. 


EXAMPLE 


5. The following is a necessary and sufficient condition for the conver- 
gence of Bernoulli’s method mentioned in §1.3: Among all zeros of 
maximum modulus of the polynomial (1-3) there exists exactly one zero of 
maximum multiplicity (see chapter 2 for an explanation of these con- 
cepts). If this condition is fulfilled, the sequence {q,} converges to that 
zero of maximum modulus and multiplicity. 


Once the question of convergence is settled, numerous other questions 
can be asked about the performance of an algorithm. One might want 
to know, for instance, how fast the algorithm converges. Or one may 
wish to know something about the size of the error, if the algorithm is 
artificially terminated after a finite number of steps. The latter question 
can be interpreted in two ways: Either one wishes to know how big the 
error is at most, or how big the error is approximately. The answer to the 
first question is given by an error bound; the answer to the second by an 
asymptotic formula. Mathematicians, who like to think in categories, 
usually give preference to error bounds. However, an error bound which 
exceeds the true error by a factor 10® is practically useless, whereas an 
approximate formula, while not representing a guaranteed bound, still 
can be very useful from a practical point of view. (Our scientists, if they 
depended on guaranteed error bounds, would never have dared to put a 
manned satellite in orbit.) Last but not least, the study of the asymptotic 
behavior frequently reveals information which enables one to speed up 
the convergence of the algorithm. Examples of this will occur in almost 
every chapter of this volume. 

Even the theoretical convergence of an algorithm does not always 
guarantee that it is practically useful. One more requirement must be 
met: The algorithm must be numerically stable. In high school we 
learned how to express the result of multiplying two six-digit decimal 
fractions (which in general has twelve digits) approximately in terms of a 
six-digit fraction by a process known as rounding. Although the resulting 
fraction is not the exact value of the product, we readily accept the minor 
inaccuracy in view of the greater manageability of the result. If several 
multiplications are to be performed in a row, rounding becomes a practical 
necessity, as it is impossible to handle an ever-increasing number of 
decimal places. In view of the fact that the individual errors due to 
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rounding are small, we usually assume that the accuracy of the final result 
is not seriously affected by the individual rounding errors. 

Modern electronic digital computers, too, work with a limited number 
of decimal places. The number of arithmetic operations that can be 
performed per unit time, however, is about a million times as large as 
that performed in manual computation. Although the individual round- 
ing errors are still small, their cumulative effect can, in view of the large 
number of arithmetic operations performed, grow very rapidly like a 
cancer, and completely invalidate the final result. In order to be sound, 
an algorithm must remain immune to the accumulation of rounding errors. 
This immunity is called numerical stability. 

The concept of numerical stability can be very well illustrated by means 
of Bernoulli’s algorithm discussed in §1.3. As mentioned above, this 
algorithm will, under certain conditions, yield the zero of maximum 
modulus of the polynomial (1-3). This algorithm is stable in the sense 
that it furnishes the zero to the same number of significant digits as are 
carried in the computation of elements of the sequence {z,}. We shall 
have the opportunity to discuss the following extension of the Bernoulli 
method: If we form the quantities 


r _ Gk+1 — Wk 
,= 


Qk — Qk-1 


then under certain hypotheses (which can be specified) the sequence {q;,,} 
tends to the zero of next smaller modulus of the polynomial p (see chapter 
8). We have thus again an instance of a convergent algorithm or, to put 
it differently, of a logically impeccable mathematical theorem. In spite of 
all this the algorithm just formulated is practically useless. The quantities 
gi, are affected by rounding errors to an extent which completely spoils 
convergence. 

In spite of the above example, the reader should not form the impression 
that stability is an invariant property which an algorithm either has or 
does not have. In the last analysis, the stability of an algorithm depends 
on the computing machine with which it is performed. The modification 
of Bernoulli’s algorithm described above is unstable if performed on an 
ordinary computing machine but it could be rendered stable on a machine 
equipped to compute with a variable number of decimal places. In a 
similar way it is possible to increase the stability of certain algorithms in 
numerical linear algebra by performing certain crucial operations with 
increased precision. 

Rather than trying to set up absolute standards of stability, it should 
be the aim of a theory of numerical stability to predict in a quantitative 
manner the extent of the influence of the accumulation of rounding errors 


q 
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if certain hypotheses are made on the individual rounding errors. Sucha 
theory will be descriptive rather than categorical. Ideally, it will predict 
the outcome of a numerical experiment, much as physical theories predict 
the outcome of physical experiments. A particularly useful model of 
error propagation is obtained if the individual errors are treated as if they 
were statistically independent random variables. In chapter 16 we will 
show how this model can be applied to a large variety of algorithms 
discussed elsewhere in this volume. 


chapter 2 complex numbers and polynomials 


One of the recurring topics in the following chapters will be the solution of 
polynomial equations. Complex numbers are an indispensable tool for 
any serious work on this problem. 


2.1 Algebraic Definition 


The reader is undoubtedly aware of the fact that within the realm of 
real numbers the operations of addition, subtraction, multiplication, and 
division (except division by zero) can be carried out without restriction. 
However, the same is not true for the extraction of square roots. The 
equation 


(2-1) x*=a 


is not always soluble. Ifa < 0, there is no real number x which satisfies 
equation (2-1), because the square of any real number is always 
nonnegative. 

Historically, complex numbers originated out of the desire to make 
equation (2-1) always solvable. This was achieved by simply postulating 
the existence of a solution also for negative values of a. Actually, only 
the solution of one special equation, namely the equation 


(2-2) x? = —] 


has to be postulated. Following Euler, we denote a solution of this 
equation by i. The symbol j is called imaginary unit; it is a “number” 
satisfying 
i? = —1, 
Postulating further that i can be treated as an ordinary number, we can 


now solve any equation (2-1) with a < 0. 
13 
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EXAMPLE 

1. A solution of x? = —25 is given by x = 15, for 
(i5)? = i#5? = (—1)25 = —25. 

Another solution is x = —i5, for 


(—i5)? = (—1)?i25? = 1(—1)25 = —25. 


Not only special equations of the form (2-1), but any quadratic equation 
of the form (2-3) 


(2-3) x? ++ 2bx +c=0 


with real coefficients b and c can now be solved. As is well known, the 
method consists in writing the term on the left in the form of a complete 
square plus correction term: 


x? + 2bx+c=(x + bf? +c — B. 
Equation (2-3) demands that 
(x + b)? = b? — Cc. 


For b? — c 2 0 we obtain the solutions familiar from elementary algebra. 
If b? — c < 0, then c — 5? > 0, and our equation is equivalent to 


(x + b)? = —(c — 5). 
According to the above, it has the “solutions” 
x+b= +ive — B? 


or 


= —b + ive — B. 


EXAMPLE 
2. The equation x? + 6x + 25 = 0 is equivalent to 


(x + 3? +25-9=0 
or 
(x + 3)? = —16. 


It therefore has the solutions 
x= —3 + 14, 


Any expression of the form a + ib, where a and b are real, is called a 
complex number and may henceforth be denoted by a single symbol such 
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asz. If z= a+ ib, we call a the real part of z and b the imaginary part 
of z. In symbols this relationship is expressed as follows: 


a= Rez, b = Imz. 


Two complex numbers are considered equal if and only if both their real 
and their imaginary parts are equal. 

What has been gained by the introduction of complex numbers besides 
the ability of solving all quadratic equations in what at the moment may 
appear as a rather formalistic manner? Certainly nothing has been lost, 
because all real numbers are contained in the set of complex numbers. 
They are simply the complex numbers with imaginary part zero. Also, 
nothing has been spoiled, because we shall see that the ordinary rules of 
arithmetic still hold for complex numbers. 

To justify this statement, let 


Zz, =a+ib 
Z2=crt+id 


be any two complex numbers. We define 


(2-4) ana Sale Se gery Hee 


Two complex numbers are added (subtracted) by adding (subtracting) 
their real and imaginary parts. The role of the “neutral element,” ie., 
of the number w satisfying z + w = z for all z, is played by the complex 
numberf 

0=0 + i0. 


Multiplying two complex numbers formally, we obtain 
Z1Zg = (a + ib)(c + id) = ac + i(ad + bc) + irbd. 
But i? = —1. We thus define: 
(2-5) Z1Zg = ac — bd + i(ad + bc). 


For the special case of real numbers (6 = d = 0) the above definitions 
reduce to the ordinary sum and product of real numbers. Moreover, the 
following rules, familiar from ordinary arithmetic, still hold for any 
complex numbers 2, Z., Z3: 


t Two different kinds of zeros are used in this equation. On the right, we have twice 
the real number zero. On the left, we have the complex number zero, whose real 
and imaginary parts are the real number zero. As no misunderstandings can arise, 
one does not try to make a graphical distinction between the two zeros. 
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The commutative laws 
Z, + Zg = Z2 + Zi, ZZ = 2221, 
the associative laws 
(2, + Zo) + 23 = 2, + (Ze + 23), Z,(ZoZ3) = (212Z2)Z3, 
and the distributive law 
Z\(Zo + Z3) = 212. + 2123. 


The proof of these relations is an immediate consequence of our definitions 
and of the corresponding rules for real numbers. 

We have yet to define the quotient z./z, of two complex numbers. The 
real case shows that some restriction has to be imposed here: The case 
Zz, = 0 will have to be excluded. Is it possible, however, to define z,/z, 
for all z, 40? To answer this question, we recall the meaning of the 
quotient. In the real case we mean by x = b/a the (unique) solution of the 
equation ax = b. Analogously, we understand by z = Z,/z, the solution 
of 


Zz =) 23. 
If z = x + iy, this amounts to 

(a + ib)(x + iy) = c + id, 
or, after separating real and imaginary parts, 


ax — by =c 


29) bx t+ay=d 


The two relations (2-6) can be regarded as a system of two linear equations 
for the two unknowns x and y. This system has a unique solution for 
arbitrary values of c and d if and only if its determinant is different from 
zero. The determinant is 


a—b 
b a 


it will be different from zero if and only if at least one of the numbers a 
and 5 is different from zero, that is, if z, # 0. If z, 4 0, the solution of 
equation (2-6) is easily found to be 


_ ac + bd _ ad — be 
~ e+e 7 PTB 


= qa? + b?; 








We thus define 
Zo ac+bd .ad— bec 


) 2 4 tT 


complex numbers and polynomials 17 


The same result could have been obtained by formal manipulation, as 
follows: Let 
_ 22 c+ id 


az, atib 





In the fraction on the right we multiply both numerator and denominator 
by a — ib. In the denominator we obtain 


(a + ib)\(a — ib) = a? + b? + i(ba — ab) = a? + Db’, 


and carrying out the multiplication in the numerator we again find 
equation (2-7). This procedure cannot replace the proof that z,z = z, 
has the solution (2-7), because it assumes the existence of the solution; 
once existence has been established, however, it is very useful for the 
computation of the numerical value of z./z,. 


EXAMPLES 


3. 


4. The reciprocal z = 1/(a + ib) of a complex number a + ib ¥ 0 is 
calculated as follows: 


a 1 a — ib _a-ib a _; b 
~ at+ib (a+ ibia—ib) a?+b? a? +B? a’ + b? 


5. To express 


in the form a+ ib. Successive applications of the above method of 
reduction yield 








18 elements of numerical analysis 
Problems 
1. Express in the form a + ib 
ra/3\ 2 
@ (APY, 
(b) (1 — a; 
(c) (cos + isin @)?; 
1 
cos p — ising’ 


r+7+8) 7, 1 
7+ 8 


(d) 





2. What are the values of z for which the complex number 


1 1 
i¢z’i+ipz 
is undefined ? 
3. Let s= 1+ 2+ 272 +--+-+ 2", where z is a complex number, z # 1. 
Find a closed formula for s by forming the expression s — zs. 


2.2 Geometrical Interpretation 


The discussion of §2.1 will have satisfied the student that complex 
numbers can be manipulated exactly like real numbers. At the moment 
this result may seem a mere formal curiosity. However, the significance 
of complex numbers in analysis is based on the fact that the algebraic 
operations discussed in §2.1 admit simple and beautiful geometric 
interpretations. 

A complex number z = a + ib can be represented geometrically in 
either of two ways: 

(i) We can associate with z the point with coordinates (a, b) in the 
(x, y)-plane. If the points of an (x, y)-plane are thought of as complex 
numbers, that plane is called complex plane. Its x-axis is the locus of 
complex numbers that are real; it is therefore called the real axis. Its 
y-axis carries the points with real part zero; it is called the imaginary axis 
of the complex plane. 

(ii) We can associate with a complex number z the two-dimensional 
vector with components aand b. We think of this vector as a free vector, 
i.e., a vector that may be shifted around arbitrarily as long as its direction 
remains unchanged. 
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Both interpretations can be combined into one by attaching the vector z 
to the origin of the plane. It then becomes the radius vector pointing 
from the origin to the point z. 


EXAMPLE 


6. To the four complex numbers z = 1, z = i, z = —1, z = —i there 
correspond points (Fig. 2.2c) or radius vectors (Fig. 2.2d). 


If complex numbers are interpreted as vectors, addition of two complex 
numbers amounts to ordinary addition of the corresponding vectors. 
For indeed, the sum of two vectors is obtained by forming the sum of 
corresponding components, just as in the case of complex numbers 
(see Fig. 2.2e). 

The difference z; — Zz, of two complex numbers can be interpreted as 
the difference of the corresponding vectors. As is well known, the 
difference of two vectors z, and z, can be defined by adding to z, the vector 
— Zg, 1.e., the vector which has the same length as Zz, but direction opposite 
to z,. If we attach both vectors z, and z, to the same point, then the 
difference z,; — z, corresponds to the vector pointing from the head of z, 
to the head of z, (see Fig. 2.2f). 

Multiplication of a complex number by a real number can be similarly 
interpreted. Let c be a real number, and let z=a-+ ib. Writing 
c = c + iO, we obtain from the multiplication rule 


cz = (c + i0)(a + ib) = ca + icb. 


The vector cz thus has the same direction as z, if c > 0, and it has the 
opposite direction, ifc <0. Ifc = 0, cz is the zero vector. The length 
of cz is always |c| times the length of z. 





(6) 


Figure 2.2 
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21 + 22 


(e) 





(g) 


Figure 2.2 


(d) 





(f) 
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The question of how to interpret the product of two arbitrary complex 
numbers now arises. In order to arrive at a suitable interpretation, we 
require the notions of the absolute value and of the argument of a complex 
number. 

The absolute value or modulus of a complex number z = a + ib is 
denoted by |z| and is defined as the length of the vector representing z, or 
equivalently, as the distance from the origin of the point in the complex 
plane representing z. It follows from Pythagoras’ theorem that 


(2-8) |z| = Va? + b?. 


The absolute value of a complex number z is zero if and only if z = 0. 
Otherwise the absolute value is positive. 

The argument of a complex number is defined only when z # 0. It is 
denoted by arg z and denotes any angle subtended by the vector z and the 
positive real axis. The angle is counted positively if a counterclockwise 
rotation of the positive real axis would be required to make its direction 
coincide with the direction of the vector z (see Fig. 2.2g). 

The argument of a complex number is not uniquely determined. If » 
is a value of arg z, then every angle » + 27k (k an arbitrary integer) is also 
a possible value of argz. Thus, the argument of a complex number is 
determined onlyf up to multiples of 27. 

To compute an argument 9 of z = a + ib # 0 we proceed as follows. 
Letting |z| = r, we have from figure 2.2g 


(2-9) a =rcosq, b=rsing. 


We can use the first of these equations to determine the absolute value of p 
by means of the relations 


corp=% iol sz. 


If b # 0, the sign of is the same as the sign of 5, as follows from the 
relations 


: : ._ 4b 
sign » = sign (sin y) = sign > = Sign b, 


since r>0. If 6=0, then a=r or a= -—r. In the former case, 
g = 0. In the latter case, p = +z. Since 7 and —7z differ by 27, both 
values are equally admissible values of arg z. 


+ The more advanced development of the theory of complex numbers shows that it 
would be unwise to restrict the argument artificially to an interval of length 27 by 
stipulating a condition such as —7 < » Sz or 0 S 9 < 2z., 
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EXAMPLES 
7. Letz = 4+ 13. From (2-8) 


|z| = V4? + 32 = V25 = 5. 
To compute » = arg z, we first use the relation 
cos » = # to find |p| = 36° 52’. 


Since Im z = 3 > 0, it follows that » = 36° 52’. 

8. Forz = —4 — i3 we likewise find |z| = 5, but in view of cos» = —# 
we now have |p| = 143° 8’ and, since Im z = —3 < 0,9 = — 143° 8’. 

9. The absolute value of a complex number with imaginary part zero 
coincides with the absolute value (as defined for real numbers) of its real 
part. Its argument is 0 (or 27k), if the real part is positive, and z 
(or 7 + 27k), if the real part is negative. 


Relation (2-9) shows that every complex number z # 0 can be repre- 
sented in a unique manner as the product of a real, positive number and of 
a complex number with absolute value |. This representation is given by 


(2-10) z = r(cos » + isin 9), 


where r = |z|, and where denotes any value of arg z. The representa- 
tion (2-10) is called the polar representation or polar form of the complex 
number z. 

The polar representation enables us to give a geometric interpretation 
of the product and the quotient of two complex numbers. Let 


r,(cos g, + isin 9) 
ro(COS pg + isin HY.) 


21 
29 


be any two nonzero complex numbers. Multiplying by the algebraic 
rules of §2.1 we find 


21Zq = P1le[COS M1 COS Ye — SIN g, SIN Ye 
+ i(cos ¢; Sin gg + SiN Pg; COS Pa)] 


or, by virtue of the addition theorems of the trigonometric functions, 
2122 = rralcos (pi + 2) + isin (yp: + %2)). 


This, however, is the polar representation of the complex number with 


absolute value r;rg and argument ¢, + 2. We thus have proved the 
relations 


(2-11) |z,z>| = \z,| \zo|, arg (2122) = arg 21 + arg Z2: 
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The second of these is to be interpreted in the following sense: The sum 
of any two admissible values of arg z, and arg z. yields an admissible 
value of arg (z,Z2). 


EXAMPLES 

10. Let z, be a positive real number. We then have [z.| = Ze, ye = 0. 
Multiplication of z, by z. amounts to a dilatation or contraction of the 
vector z,, according to whether z. > 1 or 0 < z,. < 1. 

11. If z. is a complex number of absolute value 1, multiplication by z, 
amounts to a counterclockwise rotation by argz,. Special case: 
Multiplication by i amounts to a rotation by 7/2. 


It has already been noted that for real numbers complex multiplication 
reduces to ordinary real multiplication. Inasmuch the argument of a 
negative real number may be taken as +7, the rules (2-11) can be thought 
of as a generalization of the rule ‘minus x minus = plus” of high school 
algebra. 

We now compute the quotient of two complex numbers given in polar 
form. Using (2-7) we find 


Zz _ PyF2(COS Y, COS Po + SIN G, SIN Po) 
Fa ee ag, ey as 
1 1 


mn ; TiF2(cos Y1 SIN gz — COS Pe SIN g;) 
2 
ri 


r _ 
= [cos (p2 — ¢1) + isin (v2 — ¢1)]- 


The expression on the right is the polar form of the complex number with 
absolute value r,/r; and argument gy, — y,. We thus have the formulae 








Za) _ |2al (2) 2 Z 
(2-12) ee Z| arg ot laa arg Z. — arg Zj. 
An important special case arises when z, = 1. Since |1| = 1, arg 1 = 0 
we then have 
1 I 1 
(2-13) ee Tal’ arg (=) = —argz. 








The construction of the reciprocal 1/z for a given z thus involves two 
interchangeable steps: (i) Multiply z by 1/|z|?; (di) Reverse the sign of the 
argument of the number thus obtained. The second of these steps— 
reversal of the sign of the argument—geometrically amounts to a reflection 
of the number on the real axis. As this operation of reflection occurs 
frequently in other contexts, too, a special notation has been devised for it. 
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If z = a + ib is any complex number, reflection on the real axis yields the 
number a — ib. This number is called the complex conjugate of z and 
is usually denoted by Z. Evidently, 


(2-14) |zZ| = |zI, argZ = —argz. 


The real and imaginary parts of a complex number can be expressed in 
terms of a complex number and its conjugate. By adding and subtracting 
the relations z = a + ib, Z = a — ib we obtain 


z+Z = 2a, z — Z = 2ib, 








hence 
(2-15) a= Rez= “35, b= Imz = =+* 
From 


zz = (a + ib)(a — ib) = a? + b? = |z|? 
there follows the relation 
(2-16) \z| = Vzz. 


The following rules of computation are easily verified: 





Z, + 2g = Zy oe Za 
(2-17) et 


2122 = 21Z2- 


These rules say, in effect, that in forming the complex conjugate of a sum 
or a product it is immaterial whether we take the complex conjugate of the 
result of the algebraic operation, or whether we perform the algebraic 
operation on the complex conjugate quantities. In the case of the 
reciprocal we have 

1 (i) 

>~ (;) 


We shall use the rules just established to give a purely analytical proof 
of the so-called triangle inequalities. These inequalities say that in an 
arbitrary triangle the length of one edge is at most equal to the sum, and 
at least equal to the difference, of the lengths of the two other sides. If 
we represent the sides of the triangle by the vectors z,, z., and z, + Z. 
(see Fig. 2.2e), the triangle inequalities are given by 


(2-18) [|zil — [zeal] S [zi + Zef S [za] + |zal. 
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To prove this analytically, we note that by (2-16) and (2-17) 


(2-19) |Z. + Z|? = (z, + 22)(Z1 + Zz) 

(Z1 + 2Z2)(Z1 + 22) 

= 232; + 21Z_ + ZoZ, + ZoZe 
= |z,|? + 24.2%. + 2927, + |Z2|?. 


In view of 2,22 = 2,2. = ZZ. we have by (2-15) 
ZyZq + ZZ. = 24%, + 2% = 2 Rez, 70. 
Now for an arbitrary complex number z, 
[Rez] S |z|. 
Setting z = z,Z., we thus have by (2-11) and (2-14) 
[Re 2122| S$ |2122| = |21| [Zo] = |2:| [Zol. 
Thus there follows from (2-19) 


|Z, + Ze? S [zi|? + |z2|? + 2[z1| |zo| 
= (|z,| + [z2|)? 


and hence by taking square roots, 
|Z, + Zo] S [2a| + |zel. 
On the other hand, since Re z = —|z|, we also have 


|Z, + Zeal? 2 |z,|? + |z2l? — 2|z,| |zol 
= ({2,| — |Z2|)? 


and hence, by taking square roots, 
|Z: + Za] 2 |]z:] — [zal |. 


This completes the proof of (2-18). 


Problems 


4. Compute the absolute value and all arguments of the complex numbers 


1+i 


(a) —i2, (b) —6 + #8, (c) 5 





5. Let w = —4$ + i(V 3/2). Show algebraically that w3 = 1. What do 
you conclude about argw? What about |w|? Verify your conclusions 
by calculating |w| and arg w. 
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6. Show that the arguments p of the complex number z = a + ib (a # 0) 
are given by 


p= arc tan 2 + 2rk, if a> 0, 


go = arc tan? + (2k + 1)7, if a < 0, 


where & denotes an arbitrary integer, and where arctan x denotes the 
principal value (lying between — 7/2 and 7/2) of the arcus tangens function. 
What is arg z if a = 0? 

7. Using induction, prove that for arbitrary complex numbers 21, Ze, ..., Zn 


Izy + Za t+-++ Zn| S [zi] + |zol +--+ [z,l. 


8. Determine the equation of the locus of those points z = x + iy of the 
complex plane with the following property: The ratio of the distances of 
z from the points +1 and —1 has the constant value k. (Hint: The 
distances can be expressed in the form |z — 1| and |z + 1].) 

9. What is i" for arbitrary integral n? 


2.3 Powers and Roots 


Applying the multiplication formula (2-11) to a product of n equal 
factors z = r(cos m + isin ¢), we obtain 


(2-20) [r(cos » + isin p)]" = r"(cos np + sin ng). 
Setting r = 1, there results Moivre’s formula 
(2-21) (cos my + isin gy)" = cosnp + sin ng. 


We expand the expression on the left by the binomial theorem and 
observe that i? = —1,i° = —i,i* = 1,.... Equating real and imaginary 
parts on both sides of (2-21), we obtain 


n . n 7 
cos" » — (3) cos"~? » sin? » + (;) cos" * » sin* » —--- = cos ng, 
i n-1 oi ES 2-3 on cind 
(7) cos"~ sing — (3) cos*-° p sin’ mp +--+: = sin ng. 
Here the symbol (;) denotes the binomial coefficient defined by 
n n! 
(;) =F K=O dee 
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EXAMPLE 
12. Forn = 3 we obtain 


3 ; 
cos 3p = cos? » — (5) cos y sin? p = cos® p — 3 cos 9 sin? 9, 


3 ; 3K 
sin 3p (;) cos? p sing — (3) sin? mp = 3 cos* pm sing — sin? ¢. 


We are now ready to tackle the problem of computing the rth roots of a 
complex number. In the real domain we call nth root of a real number a 
any number x satisfying the equation 


xO =a: 


Analogously we shall call nth root of the complex number w any number 
z= x + iy satisfying the equation 


(2-22) 2 Sow. 


If w = 0 this clearly has the only solution z = 0. In order to determine 
all possible solutions z for w # 0, we write 


w = p(cosa + isin a) 
and seek to determine the polar form z = r(cosgy + ising). By (2-20), 
z” = r"(cos np + isin ng), 
and condition (2-22) assumes the form 
(2-23) r"(cos np + isin np) = p(cos « + isin a). 
This condition evidently will be satisfied if 


r™ = p, ie, r= Vp 
and 


R 


np = a, 1.€., p = — 


= 


Thus the number 
z=WVp (cos ¢ + isin ©) 
n n 


certainly is an nth root of the complex number p(cos a + isina). Is it the 
only one? The absolute value of a nonzero complex number being 
positive,  p evidently is the only possible absolute value of z. However, 
other values of the argument are possible. We recall that the argument of 
a complex number is determined only up to multiples of 27. Thus 
condition (2-23) is also satisfied if 


np = a+ k2n, 
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where k is an arbitrary integer. We thus obtain an infinity of possible 
values of ¢: 
a, Kn 


er 


k =0, +1, +2,.... 
n n 


Not all these values yield different solutions of (2-23), however. The 
same solution z is defined by two values of p that differ merely by an 
integral multiple of 27. This will occur as soon as the corresponding 
values of & differ by an integral multiple of 2. Different solutions of 
(2-23) thus are obtained only by selecting the values k = 0,1,...,n — 1. 
We summarize this result as follows: 


Theorem 2.3 The equation z* = w, where w # 0 and nis a positive 
integer, has exactly n solutions. They are given by 

z =r(cos¢ + isin 9), 
where 


r= V|w| and p = MEM Ao, k =0,1,2,...,.n -—1. 





Geometrically, these solutions all lie on a circle of radius r. The 
argument of one solution is |/m times the argument (any argument) of w; 
the remaining solutions divide the circumference of the circle in n equal 
parts. 


EXAMPLES 

13. To compute all solutions of z* = i. Since |i| = 1, all solutions have 
absolute value one. In view of arg i = 7/2, one solution has the argument 
7/8; the remaining solutions divide the unit circle into four equal parts 
(see Fig. 2.3). 

14. In the real domain the number | has either one or two ath roots, 
according to whether n is odd or even. In the complex domain the 
number 1 = 1 + i0 must have n nth roots. Since V1 = 1, all roots 
have absolute value 1. Since arg 1 = 0, the arguments of these n nth 
roots of unity are given by 27k/n,k = 0,1,...,2 — 1. Special example: 
For” = 3 we obtain the following three third roots of unity: 


270 ... 270 

cos —~ + isin—- = 1 

soeee Dan ee v3 
gore gD 
dr |. An 1 (v3 
cos—=— + i1sin-—- = -~-Il— 


3 3 2 2 
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Figure 2.3 


For n even one root of unity (namely the one obtained for k = n/2) has the 
argument 


tN 


T 


20” 


Nis 


and thus is real and negative, i.e., we obtain two real nth roots of unity. 
For 1 odd the equation 


ie 
n 


is not satisfied for any integer k, thus there is no negative root of unity. 
The example shows how by introducing complex numbers cumbersome 
distinctions of special cases can be avoided or dealt with from a unified 
point of view. 


Problems 


10. Determine all solutions of the equation 
P+z4+27%74---427=0. 
(Use problem 3.) 
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11. Find closed expressions for the sums 


A, = 1+ rcosgp + r* cos 2p +--++ r" cos ng, 


B, = rsing + r?sin2p +--+ + r° sin np. 
(Hint: Let z = r(cosm + isin gy) and consider A, + iB,.) 
12. Determine all solutions of the equations 
(a) z*= —-4,  (b) 22 =i, (c) 22 = 3V3. 


13. Express the function cos 49 as a polynomial in cos 9. 


14. Express the two values of Vx + iy in the form a + ib. (Hint: Square 
and compare real and imaginary parts.) 


2.4 The Complex Exponential Function 


Beginning with this section we shall study certain important functions of 
a complex variable. In the real domain, a function is defined if to every 
real number of a certain set of real numbers (called the domain of the 
function) there is made to correspond (for instance, by an algebraic 
formula) a number of another set (called the range of the function). A 
complex function is defined in the same manner, with the difference that 
now both domain and range of the function are, in general, sets of complex 
numbers. 

We begin by defining the exponential function for complex values of 
the variable. In the real domain the exponential function can be defined 
by the exponential series, 


dS 
* 
oo 


*=1lt+ 5+ + 54°: 


rd 


1s 


w 


Replacing x by iy, where y is real, we obtain formally 


, gaat “aioe 
14 27,0%,@ 





iy 
Sie. oe ge a 
Using the fact that i? = —1, i? = —i, i* = 1,..., this may be written 
iy — y_ yr _ i 
ev I+ 71 31 a . 
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The two series on the right are recognized as the Maclaurin expansions of 
cos y and siny. Thus we obtain 


2-24 e'’” = cosy + isiny. 
( 


The above computations are purely formal in the sense that they lack 
analytic justification. (It has not been defined, for instance, what is 
meant by the sum of an infinite series with complex terms, nor do we know 
whether it is permissible to rearrange the terms in such a series.) How- 
ever, nothing keeps us from adopting equation (2-24) as the definition of 
the exponential function e* for purely imaginary values of z._ It will soon 
be obvious that this definition is reasonable for a number of reasons. 

We notice, first of all, that the polar form of a complex number z # 0, 
where r = |z| and » = arg z, can be written thus: 


z= re, 
We also note the following properties of the function e'”, which are 
analogous to properties of the real exponential function: 
(a) ef? =cos0+isnO=1+i0=1. 


For the one number which is real as well as purely imaginary, the new 
definition (2-24) is compatible with the real definition. 

(b) The real exponential function satisfies the so-called addition 
theorem 


ertiet2 = e%17* 7%: 


for arbitrary real numbers x, and x2. In analogy, the following identity 
holds for arbitrary real y,; and ye: 


(2-25) el¥iel¥2 = ef(¥, +42) 


The proof follows from the multiplication rule (2-11) in view of the fact 
that e’” is a complex number with absolute value | and argument y. 

The following two properties of the complex exponential function have 
no direct analog in the realm of real numbers. 

(c) We have e?"' = cos 27 + isin 27 and hence 


ea 


(This formula connects several important numbers of analysis.) 
(d) It follows from (2-25) and (c) that 


elly+2m) —. eV e2n = gl 


for arbitrary y. This relation expresses the fact that the exponential 
function is periodic with period 2zi. 
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Above, we have defined e’” in terms of cos y and sin y. We now shall 
show how to express the trigonometric functions in terms of the exponential] 
function. Replacing in (2-24) y by —y, we get 


(2-26) e-¥ = cos(—y) + isin(—y) = cosy — isiny. 


Adding and subtracting (2-24) and (2-26) and solving for cos y and sin y, 
we obtain Euler’s formulas 


ev 4+ ely : e'¥ — ety 
(2-27) a, a sin y = —— 


EXAMPLE 

15. Moivre’s formulas enable us to express cos np and sin nq in terms of 
powers of cosq and sing. Here we are concerned with the converse 
problem: To express a power of cos 9 or sin 9 as a linear combination of 
the functions cos 9, cos 2y,..., sin g, sin 2g,.... We explain the method 
by considering the function cos* y. By Euler’s formula, 


to -10\ 4 


or, using the binomial theorem, since (e'®)* = e'*°, 
(cos p)* = xs(e*#? + 4e7!? + 6 + 4e- 70 + eo He), 


Collecting the terms with the same absolute value of the exponents and 
applying Euler’s formula again, we find 


(cos g)! = ah(et? + 7H) + He + e-Ml0) 4 g 
= $08 4p + $cos 2p + 3. 


There remains the problem of defining e* for an arbitrary complex 
number z = x + iy. Weadopt the following definition: 


(2-28) extiv = etel¥ = e(cos y + isin y). 

With this definition, the addition theorem 

(2-29) e71e72 = e71 +22 

holds for arbitrary complex numbers z,; = x, + iy, and Ze = X2q + iyo. 


Proof: 


e71e72 = etietzelielv2 (by (2-28)) 
ert el +42) (by (2-25)) 
e%1+ 2 +i(y) + YQ) (by (2-28)) 


4 e771 tq) 
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As a consequence of the addition theorem we find that for any positive 
integer n 

(e?)" _— e"? 
furthermore, setting z, = z and z, = —z and observing property (a) we 
obtain 


ee"? = eF-*F = eo = I, 


Consequently the relation 


das. Se 
e = oe 


familiar for real values of z, holds true also for arbitrary complex values. 
It shows, among other things, that the complex exponential function 
never assumes the value zero. 


Problems 


15. Express sin* p cos? ¢ in terms of the functions cos 9, cos 29,.... 
16. Prove that 

2n 

(7) 


2n 
I (cos ~)?" dp = 2m 5: 
0 2 





17. The logarithm y = log x of a real, positive number x is defined as the 
unique solution of the equation e’ = x. How would you define the 
logarithm of a complex number w? Does your definition lead to a unique 
value of log w? What are the possible value(s) of log (— 1)? 

18. If t is regarded as the time, the point z = z(t) = Re‘ travels on a circle of 
radius R. What is the locus of the points 


(a) z = ae’ + be“ (a, b real); 


(b) Zz 


cos 2 + icost (O0Sts7)? 


19. Prove that e? = e? for all complex z. 


2.5 Polynomials 


Let Go, @;, @g,...,a, be m+ 1 arbitrary complex numbers, ay # 0. 
The complex-valued function p defined by 
(2-30) p(2) = agz" + ayz"-* + age"? +--+ a, 


is called a polynomial of degree n. The constants do,..., a, are called the 
coefficients of the polynomial. The number a, is called the Jeading 
coefficient. A polynomial is called real if all its coefficients are real. 
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Any real or complex number z for which 


p(z) = 0 


is called a zero of the polynomial p. We know that a real polynomial 
does not necessarily have real zeros; it suffices to recall the case of poly- 
nomials of degree 2. In the complex case, however, the situation is 
different. 


Theorem 2.5a Every polynomial of degree n 2 1 has at least one 
zero. 


This is the so-called fundamental theorem of algebra, which was first 
stated and proved (in five different ways) by C. F. Gauss (1777-1855). 
The proofs of Gauss were either algebraically involved, or they used 
advanced methods. Today the modern tools of analysis make it possible 
to give a relatively simple proof that is accessible to any student of 
advanced calculus (see e.g., Landau [1951], p. 233). However, a presenta- 
tion of this proof would lead us too far astray from the main themes of 
this book and it is therefore omitted. 

Taking the fundamental theorem for granted, we now discuss several of 
its applications. 


Theorem 2.5b Let p be a polynomial of degree n = 1, and let 
pP(zZi:) = 0. Then there exists a polynomial p, of degree n — 1 such 
that 


p(Z) = (2 — 2;)pi(2) 


identically in z. 


Proof. Let the polynomial p be given by (2-30). In view of p(z,) = 0 we 
have 


P(z) = p(z) — p(21) 
= A(z” — 2) + ay(z*~* — zt7*) +--+ + a,_4(z — 2;). 


We now make use of the identities 
pis gh (Fg, 2") 2h Az, eg ee ae tei) 
(A = 1,2,...,n). Factoring out z — z,, we obtain 


p(z) = (2 — 2)[ao(z"~* + 2"7-?7z, +--+ 4 zzt7? + 2474) 
+ ay(z*~? + 2®-98z, +++> + z7¥) 
Beaks 
i An -2(Z P 21) - ay). 
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Calling the expression in brackets p,(z), we find upon rearranging that 
Pi(Z) = Ap2"~* + (oz, + ay)z"~?7 ++ 
+ (Qozt~* + ayzin® +--+ + daa); 


Pp, thus is again a polynomial; since a, # 0, the degree of p, is n — 1. 
This completes the proof of theorem 2.5b. 

Let now p be a polynomial of degree n > 1. By the fundamental 
theorem, it has at least one zero, z, say. By theorem 2.5b, p can thus be 
represented in the form (z — z,)p,(z), where p, is of degree n — 1 2 1. 
Again by the fundamental theorem, p, has at least one zero, say Z2, and 
thus is representable in the form (z — z.)p2(z), where p, is a polynomial of 
degree n — 2. The process evidently can be continued until we arrive at 
a polynomial p, of degree zero. Since the leading coefficient of every p,, 
IS Ag, it follows that p,(z) = ad) # 0. We thus find the following repre- 
sentation for the polynomial p: 


(2-31) P(Z) = (Z — 21)(Z — Ze). ..(Z — Zn)ao. 
We thus have represented p(z) as a product of polynomials of degree 1, or 
linear polynomials, of the form z — z,. The numbers z, (kK = 1, 2,...,n) 


evidently all are zeros of p. Any number different from all the z,,’s cannot 
be a zero, since a product of nonzero complex numbers never vanishes. 
It is not asserted that the numbers z, are all different. However, we have 
proved the following: A polynomial of degree n = | has, at most, n distinct 
zeros. 

An important consequence of this statement is as follows: Jf two 
polynomials p and q, both known to have degrees Sn, assume the same 
values at n + 1 points, then they are identical. For the proof, consider 
the polynomial p — q. Its degree is at most n, and yet its value is zero at 
the m + | points where p and q agree. This can only be if p — q vanishes 
identically. 

Above, we have in the relation (2-31) represented a polynomial of degree 
nas a product of linear factors. It is conceivable that this representation 
depends on the order in which the zeros 2, Z2,..., Z, have been split off. 
However, we now shall show that for a given polynomial p the representa- 
tion (2-31) is unique (apart, obviously, from the order in which the factors 
appear). 

Our assertion is trivial if p has n distinct zeros, for in this case each 
zero must appear in the representation (2-31), and since each must occur, 
each can occur only once. It is quite possible, however, that a poly- 
nomial of degree n > 1 has less than n distinct zeros, and that as a conse- 
quence the numbers 2), Z2,..., Z, are not all different from each other. 
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EXAMPLE 
16. The polynomial p(z) = z* — 4z° + 62? — 4z +1 has the repre- 
sentation 


pz) = @ - IP = @- DE - VE - YE — DV. 


There is only one zero, z = 1; in the representation (2-31) we have 
ae a ee ea 


In order to prove the uniqueness of the representation (2-31) even in the 
case of repeated zeros, assume p has another representation of the same 
form, say 


(2-32) P(z) = (z — 2i)(z — 22)...(Z — Zn) bo. 
From equations (2-31) and (2-32) we obtain the identity 


(Z — 21)(Z — 22)...(Z — Zn)do = (2 — 21)(Z — 22)...(Z — 2n)Bo. 


The expression on the left is zero for z = z,. Hence the expression on the 
right, too, must vanish when z = z,. This is possible only if one of the 
numbers z;, say 21, equals z,. Cancelling the factor z — z,, we obtain 


(z — Zg)...(Z — Zp)Ayo = (Z — 29)...(Z — Zp)bo. 


Because we have divided by z — 2, this identity is at the moment proved 
for z # z, only; however, since both sides are polynomials (of degree 
n — 1), and since the set of all points #z, comprises more than n — 1 
points, the identity must in fact hold for all z) Now we can continue as 
above: One of the numbers z3,..., Z,, Say Z2, must be equal to z, (even 
though z, could be the same as z,). Splitting off the factor z — z) 
and continuing in the same vein, we find that for a suitable numbering 
of the z;, 


Ce ae eee 2 oocyte 
We thus have proved: 


Theorem 2.5c A polynomial p(z) = aoz" + a,z"-1+---+ a, of 
degree n can be represented in a unique manner (up to the order of 
the factors) in the form 


P(Z) = (Z — 2)(2 — Z2)...(Z — Zn) ao. 
Each zero of p occurs at least once among the numbers 2), Zg,..., Zn: 


If a zero z of p occurs precisely & times among the numbers 2, Zo, ..., Zns 
the zero is said to have multiplicity k. A zero of multiplicity one is also 
called a simple zero. The zeros alone do not completely determine a 
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polynomial, not even up to a constant factor, but the zeros together with 
their multiplicities do. 


EXAMPLE 
17. The two third degree polynomials 


pz) = (z — (z+ le = 2+ 27 -2z- 1 
gz) = (z — 1I)2(z +:1) = 23 — 2? -—2z 4 1 


both have the zeros z = 1 and z = —1, and no other zeros. In the case 
of p, z = 1 has multiplicity 1 and z = —1 has multiplicity 2. In the case 
of g the multiplicities are reversed. 


Multiplying out the factors in (2-31) and comparing the coefficients 
with those of (2-30) we obtain relations between the numbers z, and the 
coefficients of the polynomial. These relations are known as Vieta’s 
formulas. The simplest of these formulas are those obtained by compar- 
ing the coefficients of z*~! and by comparing the constant terms: 


(2-33) ZZe+.-%, = (—1°92 
ao 
(2-34) 21+ zg bees 2, me oe 
a 


The fundamental theorem of algebra guarantees the existence of at 
least one zero of any polynomial of positive degree, and thus indirectly of 
the representation (2-31) in terms of linear factors. A completely 
different problem is the actual computation of these zeros. As pointed 
out already in §1.3, explicit formulas (involving root operations) can be 
given only if the degree does not exceed four, and are frequently impractical 
already when the degree is three. For polynomials of degree > 4 the zeros 
can be found, in general, only by algorithmic techniques. However, for 
some special polynomials of arbitrary degree the zeros can be found 
explicitly. 


EXAMPLES 


18. Let p(z) = z* — 1. According to §2.3 the zeros of this polynomial 
are the nth roots of unity. Since ay = 1, the representation in terms of 
linear factors is thus given by 


z*™—1 =(z — Iz — e!)(z — #9). (2 — ec M9), 


where y = 2zn/n. 
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19. To determine the representation of the polynomial 
Pay SH 2 ze att ck 
in terms of linear factors. We find 
zp(z) = 2th pz 4 zh-t +e tz, 
and upon subtracting p(z) and dividing by z — 1, 


n+i1 1 


p(z) = ——— (z # 1). 


The zeros of p thus are the (n + 1)st roots of unity, with the exception of 
z= 1. For instance for n = 3, 


l+z4+27 4+ 23 =(z —i)(z + 1)(z + i). 


The above theory refers to arbitrary polynomials with complex coeffi- 
cients. Real polynomials are contained in this class as a special case. 
We now shall state two theorems that are true only for real poly- 
nomials. 


Theorem 2.5d_ If z is a zero of the real polynomial p, then Z is a 
zero also. 


Proof. Let 
P(2) = doz" + ayz"~=* +---+ a, 


be the given real polynomial. If 
0 = az" + ayz"-1 4+---+ a4, 
then by repeated applications of the relations (2-17) 
0 = 0 = agz™ + ayz*-1 +--+ + 4, 
= Az" + ayz*-14---+4 4, 
= Ayz™ + ayz™-1 4---+ ap. 





However, since the a, are real, a, = a, (k = 0,1,...,n); furthermore, 
z* = (z)*. We thus have 


O = aoZ” + ayz™~1 eee ans 


i.e., the number Z is a zero of p. 

In addition to the statement of theorem 2.5d, it is also true that the 
multiplicities of z and Z are the same. For a proof see problem 26 in 
§2.6. As a consequence, we find that in the factored representation of a 
real polynomial each factor z — z, involving a nonreal z, can be grouped 
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with a factor z — z,,, where z,,, = Zz, Multiplying out such a pair of 
complex conjugate factors we obtain 


(z — zz — 24) = 27 — (Ze + %)Z + 2Zes 
a quadratic polynomial with the real coefficients —(z;, + Z,) = —2 Rez, 
and z,,z, = |z,|2.__We thus have obtained: 


Theorem 2.5e Any real polynomial of degree n 2 1 can be repre- 


sented in a unique manner as a product of real linear factors and of 
real quadratic factors with complex conjugate zeros. 


EXAMPLES 
20. 2+z274+74+1=(2—-ilzt+ 1\(z+i 
= (z?7 + 1)(z + 1). 
21. Let 
Bored adhe D(< id + v5) (. _-i + v5) 


—1 —iv3 1 —iv3 
pte =e) 
The representation by real factors is 

P(z) = (2 — Iz? — 2 + «I? +24 DE + J). 


Problems 


20. Prove: A real polynomial of odd degree has at least one real zero. 
21. Applying Vieta’s formulas to z" — 1, prove that 


1 + cos = + cos (2>7) t+ cos ((n = » =) = Q, 
n n n 

anne sin (272 SR sin ((n _ =) = 0. 
n n n 


22. Find all zeros of the following polynomials: (a) z* + 6z? + 25, (b) 
z* — 6z? + 25, (c) z® + 14z* + 625. (Use problem 14.) 

23. The real polynomial p(z) = z* + a,z? + agz* + a3gz + a, iS known to 
have the zeros 1 + iand —1 — i. Determine the coefficients a;,..., a4. 


2.6 Miultiplicity and Derivative 


In this section we wish to point out the following connection between 
the multiplicity of a zero of a polynomial p and the values of the derivatives 
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of p at the zero: If the multiplicity of a zero z of a polynomial p is k, then 


PA) =Ppey ara pr 7@) =, 
om tyre #0 

The multiplicity of a zero of a polynomial thus can be determined without 
completely factoring the polynomial simply by evaluating the derivatives 
of the polynomial at the zero. 

For real zeros of real polynomials the relations (2-35) are a straight- 
forward consequence of elementary differentiation rules of calculus. If 
xX, is a real zero of multiplicity k, we need only to observe that by theorem 
2.5c we can write 
(2-36) P(x) = (x — x1)*q(x), 
where qg is a polynomial such that qg(x,) # 0. Differentiating (2-36) by 
means of the Leibnitz formula for differentiation of a product (see Kaplan 
[1953], p. 19), we get precisely the relations (2-35). 

For complex zeros and complex polynomials, however, there arises 
first of all the question of what is meant by the derivatives p’, p”,.... 
(After all, the variables considered in the definition of the derivative in 
calculus are always real.) It is possible to extend the limit definition of 
the derivative to the complex domain in such a manner that the calculus 
proof of (2-35) retains its validity. However, some of the issues involved 
in the theory of complex differentiation are fairly sophisticated, and their 
discussion is best left to a course in complex analysis. Fortunately, as far 
as polynomials are concerned, it is possible to discuss derivatives in a 
purely algebraic manner, without considering limits. 

If p is a polynomial of degree n with real or complex coefficients, 


(2-37) P(Z) = az" + ayz"~* +-+++ a, 


we define the derivative of first order of p to be the polynomial of degree 
n—l, 


(2-38) p'(z) = naoz™") + (nm — Layz*-2 +--+ + ay_y. 

Derivatives of higher orders are defined recursively, for instance 

P’(z) = (p') (2) = n(n — NWaoz"~? + (n — 1)(m — 2)ayz"~F +--+ + 2a, 2. 
EXAMPLESf 


22. Using binomial coefficients, the kth derivative of the polynomial 
(2-37) can be expressed as follows: 


p"(z) = k!|(;)a0z"-* + (" _ ‘az? eee + (;,)ax-x]. 


ft Readers unfamiliar with binomial coefficients should at this point consult §3.5 for 
the definition and basic properties of these coefficients. 
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23. If p denotes the polynomial (2-37), then clearly 
p™(z) = nlao. 
24. We wish to calculate the kth derivative of the special polynomial 


p(z) = (z — a)? 


zt — ({)azr-? + (5 )atz-2 eee (-1"(")ar 


By example 22, we have 


pz) = kil (;)er-* oa: & fi eo 
£5 ]err Nit deh 


The products of binomial coefficients occurring here can be recombined 


as follows: 
rane _ ™—m)! n!} 
k m) kin—m—k)!m\(n — m)! 


al (n — by! 
~ k\n—k)t min —k — m)! 


( )( ) 
k m 
Thus we have 


poe 3M (Ap 
"a (" a aes aerate (-yr*(’ : et 


By the binomial theorem, the expression in brackets reduces to (z — a)"~*. 
We thus find 


p(z) = (2) — a)"-*, 
This result is familiar, of course, when z and a are real. 


It is clear that the derivative defined by equation (2-38) satisfies some of 
the laws that we expect from real calculus. Thus, if c is an arbitrary 
constant, we have 


(cp) = cp’; 
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furthermore, if p and q are any two polynomials, then 


(pP+q/=p'+q’. 
It is somewhat less obvious that the familiar rule for the differentiation of 
a product also holds in the complex case: 


(2-39) (pq) = pg + pq’. 
In order to prove (2-39), let p be given by (2-37) and q by 
g(z) = boz™ + byz™-1 +-+-4+ by. 


We agree to set a, = Ofork > nandb, = Ofork > m. The product of 
the two polynomials then can be written 


P(z)q(Z) = CoZ™*™ + CyZ"t M2 Ae eet Cran 
where 
Ce = Agby + Qyby~1 + +++ + A,-~1b, + abo, 


k = 0,1,2,...,m +n. The coefficient of z"*"~1~* in the derivative of 
pq then is (m+n —k)c,. If, on the other hand, we form the expression 
p'q + pq’ directly, we find 


P'(2)q(z) + p(2)q"(2) 
= (Ndypz"~* + (n — 1)ayz"~? +--+ + ag_s)(boz™ + Byz™~*) +--+ + By) 
+(aoz" + ayz"~1 4 +--+ a,)(mboz™-1 + (m—1)bz™-27 +--+ + Bbn_1). 
Collecting the coefficients of all terms combining to z™*"~1~-*, we find 


NAob, + (n — l)ayby-y +--+ + (n — k)aybo 
+ ao(m — k)b, + ay(m — k + M)by-1 +++++ aymby 
= (n +m— k)(aob, a a,b,-4 fteeet a;o). 


Since this agrees with (m + n — k)c,, our assertion is proved. 
As in real calculus, we now obtain by induction from (2-39) the Leibnitz 
formula for the Ath derivative of a product of two polynomials: 


(2-40) (pqy” — (;) ea + ({) o-Pa" tere t (;) pa. 
We now are ready to prove: 


Theorem 2.6 Let p be a polynomial of positive degree, and let 
z = z, bea zero of multiplicity m of p. Then 


P(21) = p21) = = pP™ MZ) = 0, p21) # 0. 
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Proof. By theorem 2.5c, the polynomial p can be written in the form 
(2-41) p(z) = (2 — 21)"q(2), 


where q is a polynomial such that qg(z,) # 0. We now form the kth 
derivative of (2-41) by the Leibnitz formula (2-40). By example 24 we find 


pire) = (shet(’)@ — 2)"-*a@) 


+(e - o(," Je - a7 


+.-. 


+ (Ze - 2)"@@). 


Here we set z = z, and distinguish two cases. If k < m, each term con- 
tains a positive power of z — z, and thus vanishes for z = z,. Ifk =m, 
all terms except the first contain positive powers of z — z,. The first 
term reduces to a nonzero constant times q(z), which is 4 0 for z = 2. 
Thus theorem 2.6 is proved. 


Problems 


24 The polynomial 
p(z) = 2° — z* — 823 + 20z2 — 177 + 5 


has the zero z = 1. Determine its multiplicity. 
25. What are the multiplicities of the zeros at z = +/ of the polynomial 


P(z) = z® + z* + 2z3 + 2274241? 


26. Prove: Complex conjugate zeros of real polynomials have equal multi- 
plicities. (Hint: The derivatives of a real polynomial are real poly- 
nomials. Now use the theorems 2.5d and 2.6.) 

27. Using the definition (2-38) of the derivative, prove Taylor’s theorem for 
arbitrary polynomials. That is, show that if p is a polynomial of degree n, 
then for arbitrary z and A 


A , h? ” hn (n) 
plz + h) = plz) + FP’) + HP") + ++ +5 PE). 
(Use the representation of example 22 for the kth derivative of p.) 


Recommended Reading 
A more thorough treatment of complex numbers and polynomials 
will be found in Birkhoff and MacLane [1953]. 


chapter 3 difference equations 


Many algorithms of numerical analysis consist in determining solutions of 
difference equations. In addition, difference equations play an important 
part in other branches of pure and applied mathematics such as com- 
binatorial analysis, the theory of probability, and mathematical economics. 

Many aspects of the theory of difference equations are similar to certain 
aspects of the theory of differential equations. We thus begin this chapter 
with a brief review of the idea of a differential equation. 


3.1. Differential Equations 


Suppose f = f(x, y, Z) is a real valued function defined for all x in an 
interval J = [a, b], and for all y and z lying in certain sets of real numbers 
S,and S,. The sets S, and S, may dependon x. The following problem 
is called a differential equation of the first order: To find a function 
y = y(x), differentiable for x € J, such that, for all x in J, 


(7) WXESo, yx) ES; 
(ii) I(x, VOX), YX) = 9. 


The essential condition here is (ii); condition (i) is merely imposed in order 
that condition (ii) makes sense. The problem thus defined is symbolically 
denoted by 


(3-1) I(x, ys’) = 9; 


any function y = p(x) satisfying the conditions (i) and (ii) is called a 
solution of the differential equation (3-1). 


EXAMPLES 
1. Let J, So, S; all be the sets of all real numbers, and let 


S(x,y, 2) =z — ky 
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where k is a constant. Every function 
yx) = Cek*, 


where C is an arbitrary constant, is a solution of the resulting differential 
equation 

y = ky. 
2. Let J be the set of all reals, and let So = S, = [-—1, 1]. If 


F(X, y, 2) = 27 — (1 — y’) 
there results the differential equation 
ysl as, 
solutions of which are given by the functions 
y(x) = sin (x — a), 
where a is again arbitrary. 


More generally, we consider differential equations of order N, where N 
is a positive integer. Let f = f(x, yo, yi, ..-, Yn) be a real valued function 
defined for x € J and for all yo, y1,..., yy in certain sets of real numbers 
So, Si,---, Sv. The problem of finding a function y = y(x) defined for 
x eT, having N derivatives on J, and satisfying the conditions 


(i) YX) ES, k= 0,1,...,.N5 


(ii) F(x, V(x), YX), «5 YP) = 0 
for all x in J, is called a differential equation of order N, and is symbolically 
denoted by 
(RI Vs ay VY) =O. 
EXAMPLE 
3. Let N = 2, and let J, So, and S, be the set of all real numbers. If 


F(X; Yor Vis Y2) = Yo + Yaz 
every function of the form 
y(x) = Acosx + Bsinx 


where A and B are constants, is a solution of the resulting differential 
equation y” + y = 0. 


The problems studied in the theory of differential equations not only 
concern the analytic representation of some or all of their solutions, but 
also the general behavior of these solutions, especially when x tends to 
infinity. 
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3.2 Difference Equations 


The fundamental problem in the theory of difference equations is in 
many ways similar to that of the theory of differential equations, with the 
exception that the mathematical object sought here is a sequence rather 
than a function. From the abstract point of view, a sequence is merely a 
function defined on a set of integers. More concretely, a sequence s is 
defined by associating with every member of a set J of integers a certain 
(real) number. Traditionally, the number associated with the integer n 
is denoted by a symbol such as s, and not by the usual functional notation 
s(n). 

In almost all applications, the domain of definition of a sequence is a 
set of consecutive integers, i.e., the set of all integers contained between 
two fixed limits, one or both of which may be infinite. Frequently / is the 
set of all nonnegative integers. 

Sequences can be denoted by a single letter such as s, much as functions 
are denoted by single letters such as f. More commonly, however, the 
sequence with the general element s, is written either as {so, 51, 52,...} OF 
simply as {s,}. 

A sequence can be defined by giving an explicit formula for s,, such as 


l n = 27. 


migasie A 


More often than not, however, the elements of a sequence are defined only 
implicitly. Difference equations are among the most important tools for 
defining sequences. 

We first explain what is meant by a difference equation of order 1. Let 
Ff = {f,0, z)} be a sequence of functions defined on a set J of consecutive 
integers and for all y and z belonging to some set of real numbers S. The 
problem is to find a sequence {x,} of real numbers defined on a set con- 
taining J so that the following conditions are satisfied for all n € I: 


(i) XE S, Xn-1ES3 


(ii) Si(Xn, Xn-1) = 0. 
A sequence {x,} satisfying these conditions is called a solution of the 
difference equation symbolized by equation (ii). 


EXAMPLES 


4. Let J be the set of all integers. The choice f,(y,z)=y—z-—1 
yields the difference equation 


Xy, — X_z-1 = 1. 


Obvious solutions are given by x, = n + c for every constant c. 
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5. Let J be the set of nonnegative integers, f,(y, z) = y—z—n. There 
results the difference equation 


Xn — Xn-1 = A, 
having (among others) the solution 
nin + 1 
x, = +0) 
6. Let J be the set of all integers, and let f,(y, z) = y — qz where q is a 
nonzero constant. We obtain the difference equation 
Xn = GXn-1- 
A solution satisfying x) = 1 is given by x, = q”. 

The difference equation of order N, where WN is a positive integer, is 
defined in a similar manner. Let f = {f,(yo, ¥1,.--, Yn)} be a sequence, 
defined on a set J of consecutive integers, whose elements are functions 
defined for y, (kK = 0, 1,..., N)inaset of real numbers S. The problem 


is to find a sequence {x,}, defined on a set containing / and satisfying the 
following conditions for all n €/: 


(i) X/6 S,. Xo 6 Oyenig koe: 

(ii) Fi(Xny Xn-1) +++) Xn-w) = 0. 

A sequence {x,} with these properties is again called a solution of the 
difference equation symbolized by (ii). 


EXAMPLES 


7. Let J be the set of all integers, f,(y0, ¥1, 2) = Yo — 2COS PY, + Ya; 
where » is real. A solution of the resulting difference equation 


Xn — 2COSYXn-1 + Xn-2 = O 


is given by x, = cos (ng). 
8. Let J be the set of nonnegative integers, f,(¥o, ¥1, Y2) = Yo — V1 — Yo: 
Can you find a solution of the difference equation 


Xn 7 Xn-1 — Xn-2 = 0? 


As in the case of differential equations, the problems studied in the 
theory of difference equations not only concern the analytic representation 
of some or all solutions, but also the general behavior of these solutions, 
especially when n tends to infinity. 

A differential equation usually has many solutions. In order to pin 
down a solution, we have to specify some additional property of it, such 
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as its value, and perhaps the values of some of its derivatives, at a given 
point. The problem of finding a solution of the differential equation 
under these side conditions is called an initial value problem. Similarly, 
in order to pin down the solution of a difference equation, we have to 
prescribe the values of some elements of the solution {x,}. For instance, 
in example 8 above, we may require that x5 = x, = 1. 

The solution of an initial value problem for a difference equation is, in a 
way, a trivial matter. We assume that the equation 


Sa(Vos V1 oo. .» Yn) = 0 


can be solved for yo, 


Yo = nl V1» Vas» + +» Yn): 


The difference equation can then be written as a recurrence relation for 
the elements of the sequence {x,}, 


(3-2) xX, = &n(Xn-1> Xn-25 +++ XH): 


Once N consecutive elements of the solution are known, further elements 
can be obtained by successively evaluating the functions g,. In this 
manner we find for the difference equation of example 8 the solution 


(i ia ip ee en Be 


There remains the problem, however, of finding an explicit formula for x,, 
and of determining ai/ solutions of a given difference equation. 


3.3 Linear Difference Equations of Order One 


A difference equation is called J/inear, if, for each n € J, the function f, 
is a linear function of yo, y1,..., ¥y. The coefficients in that linear func- 
tion may, however, depend on n. That is, for certain sequences {a ,}, 
{Qy n},>---, {Ayn}, and {b,} we have 


(3-3) Fil Vor Vis ++ +s Yu) = GonVo + ArnYr +++ + Ann Vw + dy. 
EXAMPLES 
9. The difference equations considered in the examples 4 through 8 are 
linear. 
10. The difference equation 

Kah DAN 4 PIM = 2 
is linear. 
11. The difference equation 

Xn 2x2 25 =, 0 

is not linear. 
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A linear difference equation is called homogeneous, if f,(0, 0,...,0) = 0 
for all ne J, i.e., if the sequence {b,} has zero elements only. The linear 
difference equations considered in the examples 6, 7, and 8 are homo- 
geneous. The equation of example 10 is not homogeneous. Likewise, 
the equation of example 11 is not homogeneous, because it is not linear. 

Postponing the discussion of linear difference equations of order 
N > 1 until chapter 6, we shall in the present section consider linear 
difference equations of order | only. Assuming that a), 4 0, neJ, we 
may divide through by a,,, and write any such difference equation in the 
form (3-2), viz., 


(3-4) Xn = A,Xn—-1 + Dy. 


We shall assume that J is the set of positive integers and shall try to find a 
closed expression for the solution of (3-4) satisfying an arbitrary initial 
condition x» = c. 

We first consider the homogeneous equation 


(3-5) Xn = AnXn-1- 


It is seen immediately that the solution satisfying x) = c is given by 


(3-6) Xn = CT, 
where 
(3-7) % = 1, 7, = G\dg... Ay, 1 gel We 


To find the solution of the nonhomogeneous equation (3-4) satisfying 
Xo = c, we use the method of variation of constants familiar from the 
theory of ordinary differentia] equations. Thus we set 


(3-8) Xn = CrMy, 


where the sequence {c,} is to be determined. In view of x9 = Com = Co 
the initial condition yields cp = c. Substituting (3-8) into (3-4) we find 


Coy = AnCn—1%,—-1 + Dy = Ca—1% + Oy. 


We assume that a, # 0, 2 = 1, 2,..., which implies that 7, #0. The 
last relation then yields 
b 


Ca — Cna-1 + —, 
Ty 


and it follows that 
Ch = Co a (cy — Co) + (C2 — ¢) +---+(¢, ae Cn-1) 


b b 
5 eh ee et 


Wy 72 Tn, 


I 
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For later reference we summarize the above as follows: 


Theorem 3.3 Let a, #0, n=1,2,.... The solution of the 
difference equation 


Xn = AyXn-1 + b, 


satisfying x9 = c is given by 


(3-9) xy aime t+ 4+ 22), ASA) eine 
Wy ue) Tn 
where m7 = 1, 7, = @,2...Qn; a ge ae 


We observe that this solution can be considered as being composed of 
the sequence {cz,}, which is the solution of the homogeneous equation 
having initial value c, and of the sequence 


ee ee ae 
Wy Te Tr 

which is the solution of the nonhomogeneous equation having initial value 
zero. The principle of superposition familiar from the theory of 
differential equations thus prevails also for difference equations. 

Whether or not the solution (3-9) can be expressed in simple form 
depends on the possibility of expressing the products z, and the sums c, in 
simple form, much as in the case of differential equations the simplicity 
of the solution depends on the possibility to evaluate certain integrals. 


Problems 
1. Let —1 < a< 1. Solve the difference equation 
Xo = 1, Xn = AXn-1 + 1 


and determine the limit of the sequence {x,} as n — ©. 
2. Let z be an arbitrary real number, z # 0. Solve the difference equation 


n 
Xo = 1, Xn = > Xn-1 + 1. 
Show that 


z 
yin n— oO, 


3. Let the infinite series 


Sb, 
n=0 
be convergent, and let its sum be s. Show that 


s= lim x,, 


nwo 
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where {x,} denotes the solution of the difference equation 
Xo —_ bo, Xn = Xn-1 + by. 


4. It is shown in calculus that 


oO 2 . 

Tl ( _ =) = Sin 7Z 
|= -——— 

nei n WZ 


Show that the infinite product can be obtained as lim, Xn, where {xp} 
satisfies the linear and homogeneous difference equation 


z2 
Xo = 1, x= (1-3) me |, a eS See 


3.4 Horner’s Scheme 


Several applications of theorem 3.3 will be made in later chapters. At 
this point we wish to show how difference equations can be used to cal- 
culate the values of a polynomial. 

Let bo, b,,..., by be N + 1 given constants, and let z be a given real 
number. We define 


Algorithm 3.4 Calculate the numbers xo, x,,..., xy recursively by 
the relations 
(3-10) Xo = bo, Ke 2X24 +B, N= Ap 2 cay Ne 


Evidently, the finite sequence {x,} defined in algorithm 3.4 is the solution 
satisfying x9 = bo of a difference equation (3-4), where a, = z, n= 
1,2,..., N. In this special case, 7, = z" and hence, by virtue of theorem 
3.3, 

Xn = boz™ + byz™-14---+ Bb, 


and in particular for n = N, 
xXx >= oz” + byz"-* Seid by. 
We thus have obtained: 


Theorem 3.4 The numbers x, produced by algorithm 3.4 equal the 
values of the polynomials p, defined by 


PrlX) = box" + byx"-1 +-+-+ dy (n = 0,1,...,) 
at x = z. 


Algorithm 3.4 may thus be regarded as a method for evaluating a 
polynomial. This method is generally known as Horner’s rule. It 
requires n multiplications and n additions for the evaluation of a poly- 
nomial of degree n, whereas the “straightforward” method—building up 
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the powers of x recursively by x” = x-x"~+ and subsequently multiplying 
by the 5,—normally requires 2n — 1 multiplications and n additions. 
Schematically, Horner’s method may be indicated thus: 


bo b, bg ++: by 
4 ++ 4¥ { 
Xo > Xe XQ ss > XN 


(The symbol = indicates multiplication by z.) 


Scheme 3.4 
EXAMPLE 


12. To evaluate the polynomial 
P(x) = 7x* + 5x? — 2x? + 8 


at x = 0.5. Scheme 3.4 yields (note that the coefficient b,; = 0 may not 
be suppressed !) 


i) —2 0 8 
7 8.5 2:25: - 1,125 8:5625:; 


It follows that p(0.5) = 8.5625. 


3.5 Binomial Coefficients 


We wish to obtain a generalization of algorithm 3.4 and recall to this 
end some facts about binomial coefficients. 
The binomial coefficients 


n ‘ 29 
(",) (to be read: ‘‘n over m 


are for integral n and m,n 2 0,0 S m S 2a, defined by the expansion 


(1+ x)" = (6) + (7) 2 Aes ee cie (")s* 


It is well known that 


n\ nin — l)(n — 2)...22—mt+1) _ n! 
eal) (7) — 123...n ~ mi(n — m)! 


From (3-11) it follows immediately that 


ou 20) =() 


The identity 


(3-13) (eas (eran bs (aie 
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likewise is easily proved. It states that if the binomial coefficients are 
arranged in a triangular array known as Pascal’s triangle, 


then each entry in this array is equal to the sum of the two entries im- 
mediately above it. We also shall require the following slightly less 
obvious property of the binomial coefficients. 


Theorem 3.5 For any two nonnegative integers k and n = k the 
following identity holds: 


k k+1 k+2 n n+ 1 
(ee Oe ee ces (| 
Proof. Keeping k fixed, we use induction with respect'to n. Set 


(HCE Gh 


Assuming that for some integer n — 1 2 k 


(3-14) ng fae i} 


which is certainly true for nm — 1 = k, we have, using (3-13), 


» n\ n n\ (n+ 1 
ee GREY (:) = es ) (:) a (+ i) 
proving (3-14) with nm increased by one, which suffices to establish the 
formula for all positive integers n. 


EXAMPLE 
13. Fork = 2,n = 5 we get 


(5) + (5) + G) + (G) <1 +3 + 6+ 10 = 20 = (5). 


Problems 


5. Use the definition of the binomial coefficients to prove that 


() (1) + G) ++ G) = 
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6. Obtain an alternate proof of theorem 3.5 by differentiating the identity 
(x — ICL + x + x? +--+ + x) = att} - 1 
k + 1 times and setting x = 1. 


7. Show by means of theorem 3.4 that the difference initial value problem 


yo = 1, y= ny tnt (FT) 


has the solution 


m= nfl) CE) Come O29 


Show that the same initial value problem is also solved by 
=n! k+n+1 
Yn = N: a 


and hence obtain yet another proof of theorem 3.5. (Use (3-12).) 


3.6 Evaluation of the Derivatives of a Polynomial 


One is frequently required not only to find the value of a polynomial, 
but also the values of one or several derivatives at a given point. For 
instance, if the polynomial 

P(x) = box” + byx* = +eee+ by 


is to be expanded in powers of a new variable h = x — z, then we have 
by Taylor’s theorem (see Taylor [1959], p. 471) 


p(x) = p(z + h) = Coh® + c,hN-} + ay + Cns 
where 


(3-15) Cy—k = 7 PC) k=0,1,...,N. 


These coefficients could be evaluated, of course, by differentiating p 
analytically and evaluating the resulting polynomials by Horner’s 
algorithm 3.4. It turns out, however, that there is a far shorter way. 
To this end we consider the following extension of algorithm 3.4. 


Algorithm 3.6 Let the sequence generated in algorithm 3.4 now be 

denoted by {x}. For k = 1,2,...,.N, generate the sequences 

{x} recursively by the difference equation 

(3-16) Ke eoxe Sy 2 ee. 
n=0,1,...,N—k. 


Each of the sequences {x} is generated from the preceding sequence 
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{x**-)} just as the sequence {x} was generated from the sequence 
{b,}. However, the new sequence always terminates one step before the 
old sequence. Scheme 3.6 indicates the resulting two-dimensional array 
for N = 4, 


x x? xi xf? x 


XD xD > XW 
x® x? x? 
x x 
xe 
(The symbol = indicates multiplication by z.) 
Scheme 3.6 


The relevance of this scheme to the problem stated above is based on the 
following fact: 


Theorem 3.6 If p, denotes the polynomial 
Pal) = Box” + byx"-* +--+ dB, (n = 0, Leak gd) 
then the numbers x“ determined in algorithm 3.6 satisfy 


(3-17) x = 1 pBule), 


n=1,2,....N—k; k=0,1,...,N. 


Thus, the coefficients cy _; required in (3-15) are just the entries in the 
lower diagonal (underlined in scheme 3.6) in the array of numbers 
generated by algorithm 3.6. 


Proof. Theorem 3.4 asserts the truth of formula (3-17) fork =0. We 
use induction with respect to & to prove it for arbitrary k. Assuming 
that (3-17) is true for a certain A 2 0, we can write the difference equation 
(3-16) defining the sequence {x‘°*} as follows: 


ba =6 a pez); BN 2x) + 4 ve (2); 
Expressing the solution of this equation by means of formula (3-9), where 


1 
a x =—_ k 
a, ai Zz, Ty = Zs by, - ji Pa (2), 


we obtain 


xp tO = 5 (PPO + z>* pied (z) + 27 7p o(Z) +++ + 27 "pat x(Z)}. 
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Here we express the polynomial derivatives by means of the formula of 
example 22, chapter 2: 


m— 1 


4 Pe) = (77) b02"-* + ( E jer Se (;)bn 


This yields the expression 


Bo ia —s 2 (7)be a pall ")boz am (;)| eee 


LGR, .. (ARRAN k 
L5 les Jboz* + ( Juz ++ (7)b,]} 


By rearranging the sum by collecting terms involving like powers of z this 
can be transformed into 


eer [(A) + (EET) t+ (2) doer 


+L) HEE ee ner 4 Oe 


By using the identity of theorem 3.5, this simplifies into 


e+) our) - ees ee oe 
xe (exe boz™ + (7, JOiz + FU a )On 


1 
= (k re yi Pavet 1(2)- 
EXAMPLE 
14. To compute the Taylor expansion of the polynomial 
P(x) = 7x* + 5x? — 2x? + 8 


at x = 0.5. Continuing the scheme started in example 12, we obtain 


7 5 —2 0 8 

7 8.5 PIAS) L125 8.5625 
y 12.0 8.25 2:20 

7 15.5 16.0 

7 19.0 

7 


We thus have 
Tx* + 5x? — 2x? + 8 = 7(x — 0.5)* + 19(x — 0.5) 
+ 16(x — 0.5)? 
+ 5.25(x — 0.5) + 8.5625. 
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Problems 


8. Find the Taylor expansion of the polynomial 


P(x) = 4x? — 5x + 1 


atx = —0.3. 
9. Calculate p(x), p’(x), p’(x) for x = 1.5, where 


p(x) = 2x° — 7Tx* + 3x3 — 6x? + 4x — S. 
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PART ONE 


SOLUTION OF EQUATIONS 
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chapter 4 iteration 


In this chapter we shall begin the study of the problem of solving (non- 
linear) equations in one and several variables. The method of iteration is 
the prototype of all numerical methods for attacking this problem. It is 
based on the solution of a certain nonlinear difference equation of the 
first order. The principle of iteration is of great importance also in 
certain branches of theoretical mathematics such as functional analysis 
and the theory of differential equations. 


4.1 Definition and Hypotheses 


Weare here concerned with the problem of solving equations of the form 
(4-1) x = f(x), 


where fis a given function. The method of iteration consists in executing 
the following algorithm: 


Algorithm 4.1 Choose x, arbitrarily, and generate the sequence 
{x,} recursively from the relation 


(4-2) Xn = f(Xn-1); a ee eee 


At the outset, we cannot even be sure that this algorithm is well defined. 
(It could be that f is undefined at some point f(x,).) However, let us 
assume that f is defined on some closed finite interval J = [a, b], and that 
the values of f lie in the same interval. Geometrically this means that the 
graph of the function y = f(x) is contained in the square a < x S Jb, 
as y S b(see Fig. 4.1a). 

Under this assumption, if x, € J, we can say that all elements of the 
sequence {x,} are in J. For if some x, €/ with n 2 0, then also x,,,; = 
S(x,) € J, since f has its values in J. 

A glance at figure 4.la shows that the above hypotheses are not sufficient 
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Figure 4.1 


to guarantee that equation (4-1) hasasolution. The graph of the function 
y = f(x) need not intersect the graph of the function y = x. However, if 
we assume the function f to be continuous, then the graph shows that the 
equation has at least one solution. For the graph of y = f(x) originates 
somewhere on the vertical straight line segment joining the points (a, a) 
and (a, b), and it ends somewhere on the straight line segment joining 
(6, a) and (5, 5). Since the graph is now continuous, it must intersect 
the straight line y = x(a S x S b), perhaps at an endpoint. Ifs is the 
abscissa of the point of intersection, then 


y=s and y= f(s) 
at that point, hence s = f(s), i.e., the number s is a solution of (4-1) (see 
Fig. 4.1b). 

The above intuitive consideration can be couched in purely analytical 
terms, as follows. Consider the function g defined by g(x) = x — f(x), 
a<=xsb. This function is continuous on the interval [a, b]; moreover, 
since f(a) 2 a, f(b) S 5, it satisfies g(a) < 0, g(b) 2 0. By the inter- 
mediate value theorem of calculus (see Taylor [1959], p. 240) it assumes 
all values between g(a) and g(b) somewhere in the interval [a, b]. There- 
fore it must assume the value zero, say at x = s. This implies 0 = 5 
— f(s), or s = f(s). Thus the number s is the desired solution. 

If an element of the sequence {x,} defined by algorithm 4.1 is equal to s, 
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then all later elements will also be equaltos. For this reason, any solution 
s of x = f(x) is frequently called a fixed point of the iteration defined by 
the function f. 

The assumptions made so far do not preclude the possibility of the 
existence of several, or even infinitely many, fixed points in the interval 
[a, b] (see Fig. 4.1c). 

If we wish to be sure that there is not more than one solution, we must 
make some assumption guaranteeing that the function f does not vary too 
rapidly. We could assume, for instance, that fis differentiable, and that 
its derivative f’ satisfies 


(4-3) | sl, asx<b, 


where L is some constant so that Z < 1. It turns out, however, that the 
following weaker assumption is sufficient to guarantee uniqueness: There 
exists a constant L < 1 so that for any two points x, and x2 in I the following 
inequality holds: 


(4-4) f(x.) — f(%2)| S Lia. - Xa|- 


Any condition of the form (4-4) (whether Z < 1 or not) is called a 
Lipschitz condition, and the constant L is called the Lipschitz constant. Let 
us first show that condition (4-3) implies condition (4-4) with the same 
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value of L. By the mean value theorem of differential calculus (see 
Taylor [1959], p. 240), 


I(x) — f(%2) = f'O*)1 — x2), 


where x* is a suitable point between x, and x2. Relation (4-4) now 
follows readily by taking absolute values and using (4-3). 

We now show that the Lipschitz condition (4-4) with L < | implies that 
equation (4-1) has at most one solution. Assume that there are two 
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solutions s, and so, for instance. This means that the following relations 
both are true: 


51 =f(Si), 52 = f(S2). 


Subtracting the second relation from the first, we obtain 


Sy — So = f (5:1) — f(S2). 


Taking absolute values and applying (4-4) to the difference on the right, 
we get 
[si — Sal = |f(s1) — f(S2)| S Lis. — sel. 


If s, # Sq, we can divide by |s, — s,| and obtain ! < L, contradicting the 
assumption that L < 1. Thus s, = Sg, 1.e., any two solutions of (4-1) are 
identical. 
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Problems 


1. Show that the following functions satisfy Lipschitz conditions with L < 1: 


(a) S(x) = 5 — 4 cos 3x, 0S x S 27/3; 
(b) f(x) = 2 4+ 4 xl, -lsxsl; 
(c) hc.o hee ome 22.xX S 3; 


2. Let m be any real number, and let |e| < 1. Show that the equation 
x=m-—esinx 


has a unique solution in the interval [m — 7, m + qv]. 

3. Prove that any function satisfying a Lipschitz condition on an interval J 
is continuous at every point of J. 

4. Construct an example showing that not every continuous function satisfies 
a Lipschitz condition. 


4.2 Convergence of the Iteration Method 


Having disposed of the theoretical preliminaries required to establish 
existence and uniqueness of the solution of (4-1), we now turn to the 
practical question of determining this solution by means of algorithm 4.1. 
Rather surprisingly it turns out that the assumptions which were made in 
§4.1 to guarantee existence and uniqueness suffice also to establish the 
fact that the sequence {x,} generated by algorithm 4.1 converges to the 
solution s. 


Theorem 4.2 Let J = [a, b] be a closed finite interval, and let the 
function f satisfy the following conditions: 


(i) f is continuous on J; 
(ii) f(x) eT for all x €7; 
(iii) f satisfies the Lipschitz condition (4-4) with a Lipschitz constant 
E< tl. 


Then for any choice of x9 € J the sequence defined by algorithm 4.1 
converges to the unique solution s of the equation x = f(x). 


Proof. That the hypotheses of theorem 4.2 guarantee the existence of a 
unique solution s of x = f(x) has already been shown in §4.1. To prove 
convergence of the sequence {x,}, we shall estimate the difference x, — s. 
By definition, 


Xn — S = f(Xn-1) —s = f(Xn-1) — f(s) 
and hence, by the Lipschitz condition, 


x, — s| S$ L\x,-1 — s|. 
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Applying the same inequality repeatedly, we find 
(4-5) [xy = S| SL ey sl. 
SinceO S$ L < l, 

lim L" = 0, 


n—7o 


and it follows that 
lim |x, — s| = 0, 


which means the same as 
lim x, = S, 


n- © 


completing the proof. 

The convergence of the sequence {x,} is illustrated very suggestively by 
plotting the points (Xo, Xo), (Xo, X1), (%1, ¥1), (%1, X2), (%2, Xe), (Xe, X3),.-. 
in a graph of the function y = f(x) (see Fig. 4.2). It is evident from the 
figure that convergence cannot take place in general if condition (4-4) 
does not hold with some L < 1. 

EXAMPLE 
1. It is desired to find a solution of the equation 
x=e*, 
For 0 S x S 1, the values of the function f(x) = e~* lie in the interval 
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fe~+, 1], and thus a fortiori in [0,1]. For x,, x, in this interval, we have 
by the mean value theorem 


[f(x1) — Fa) = [F'@*)| a — Hal, 


where f’(x*) = —e~*". Since the maximum of |/f’(x)| for x e [0, 1] is 1, 
(4-4) holds only with ZL = 1, violating the condition that L < 1. How- 
ever, let us consider the smaller interval J = [4, log 2]. Since 


4 = e082 < e-* S e412 < log? 


for 4 S x S log 2, J is again mapped into J by the function f, and (4-4) 
now holds with L = |f’(4)| = e717 = 0.606531. Beginning with x) = 
0.5, the first few values of the sequence {x,} are as follows: 


Problems 


Table 4.2 

n Xn I (Xn) Xn 
0 0.500000 0.606531 0.567624 
1 0.606531 0.545239 0.567299 
2 0.545239 0.579703 0.567193 
3 0.579703 0.560065 0.567159 
4 0.560065 0.571172 0.567148 
5 0.571172 0.564863 0.567145 
6 0.564863 0.568438 0.567144 
7 0.568438 0.566410 0.567144 
8 0.566410 0.567560 
9 0.567560 0.566907 

10 0.566907 0.567278 

11 0.567278 0.567067 

12 0.567067 0.567187 

13 0.567187 0.567119 

14 0.567119 


5. Kepler’s equation 
m=x — Esinx, 


where m and E are given and x is sought, plays a considerable role in 
dynamical astronomy. Solve the equation iteratively for m = 0.8, 
E = 0.2, by writing it in the form 


x=m-+4+ Esinx 


and starting with x9 = m. 
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The solution of Kepler’s equation can also be represented analytically in 
the form 


=. 2 : 
x=mt 2 a J,(nE) sin nm, 
where J, denotes the Bessel function of order n. Using tables of Bessel 


functions, check the result obtained in problem 5 and compare the amount 
of labor required. 


. Find the only positive solution of the equation 


x®-— x? -x-1=0 
by iteration, writing it in the form 
1 1 
x=1l+-+ = 
AN 


(Begin with xo = 1.) 


. Assume that the function f, in addition to the hypotheses of theorem 4.2, 


is differentiable and satisfies f(x) < 0, xel. If xo < s, prove both 
analytically and by considering the graph of f that’ 


Xo < Xqg < Xa Sees SS Sees Xsg < NXg < X4. 


. Suppose the function g is defined and differentiable on the interval [0, 1], 


and suppose that ¢(0) < 0 < g(1), 0 < a S g(x) S b, where a and b are 
constants. Show that there exists a constant M such that the solution of 
the equation g(x) = 0 can be found by applying iteration to the function 


f(x) = x + Me(x). 
What is the value of 


s2V24V24 V2 2 


(Hint: The number s may be considered as the limit of the sequence 


{x,} generated by algorithm 4.1, where f(x) = V2 +x, X09 = 0. Show 
that f satisfies, in a suitable interval, the hypotheses of theorem 4.2.) 


4.3. The Error after a Finite Number of Steps 


No computing process can be carried on indefinitely, and in any 


practical application algorithm 4.1 must be artificially terminated after 
having computed, say, the element x,. We are interested in finding a 
bound for the quantity |x, — s|, that is, for the error of x, considered as 
an approximation to the solution s. This bound should depend only on 
quantities that are known a priori and should not depend on a knowledge 
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of the solution itself. (This is why a result such as (4-5) does not satisfy 
our purpose.) 

To establish such a bound, we require the following auxiliary result: 
Forn = 0, 1, 2,..., 
(4-6) lXna1 — Xn| S L"|x, — ol. 


Evidently this is true for nm = 0. Assuming the truth of (4-6) for n = 
k — 1, where k is an integer >0, we have 


[Xena — Xel = (fx) — fx -1)| 
S L|xy — Xn-1 
S L-L*- |x, — xo 
= L*lx, — Xl, 
establishing (4-6) forn = k. The truth of (4-6) for all positive integers n 
now follows by the principle of mathematical induction. 
Let n now be a fixed positive integer, and let m > n. We shall find a 
bound for x, — x, Writing 


Xm — Xn = Xm — Xm-1) + %m-1 — Xm-2) He + Omar — Xn) 
and applying the triangle inequality (see Taylor [1959], p. 443) we get 
xm — Xal S [Xm — Xm-al + |[Xm-1 — Xm-2l t+ + + [Xana — Xal 
and using (4-6) to estimate each term on the right, 
Xm — Xn| S(L™ + 27) +---+ L(x, — Xo. 
Now, since |L| < 1, 


Lea Lest +---427=L1'1 +h 4-+--+ L"-" 
sL1+£+4+L? +---) 
ae | 
Busey 
by virtue of the familiar formula for the sum of the geometric series. We 
thus have 





Xm — X,| S |x; — Xol. 


L* 
1-L 
In this relation let m-— oo while keeping n fixed. By virtue of x, > 5s 
we obtain 


Corollary 4.3 Under the conditions of theorem 4.2 the error of the 
nth approximation x, defined by algorithm 4.1 is bounded as follows: 





(4-7) Ix, —s| S [x1 — Xol. 


yh 
1-L 
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EXAMPLE 
2. The error of the final element x,, produced in example | is less than 
[34 e~? 


Actually, since f’ < 0, it follows from problem 8 that the solution satisfies 


0.567119 < s < 0.567187. 


Problem 


11. For Kepler’s equation considered in problem 5, estimate the error of x10. 


4.4 Accelerating Convergence 


Let us now assume that the function f, in addition to satisfying the 
hypotheses of theorem 4.2, is continuously differentiable throughout the 
interval J, and that the derivative f’ is never zero. This means that / is 
either monotonically strictly increasing or monotonically strictly decreas- 
ing throughout the whole interval J. If x) # s, it is evident from a graph 
that under this condition no x, can be equal to the exact solution s, i.e., 
the iteration process cannot terminate in a finite number of steps. An 
analytical proof of this fact is as follows: Assume the contrary, namely, that 
S(x,) = x, for some n. If n is the first index for which this happens, then 


Xn = f(Xn-1) = f (Xn), Xn-1 F Xn, 
and hence, by the mean value theorem (see Taylor [1959], p. 240), 


0 = f(X%n-1) — fn) = SO )Xn-1 — Xn); 
where x* is between x,_, and x,. Since x,_, — x, # 0, it follows that 


J'(x*) = 0, contradicting the hypotheses that f’ never vanishes. 
The above implies that the error 


ad, =X, —S 
is never zero. We now ask: Does the limit 


ae 


exist, and if it does, what is its value? 
Using the mean value theorem once more, we have 


Gn41 = Xai — S 
I (Xn) —s 
f(s + dr) — f(s) 
= f(s + 8, dn) dy 
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where 0 < 6, < 1. Let us define e, by 


S'(s a 6, d,) = f(s) + &. 
We then have 


(4-8) dn+1 =. (f(s) + En) dy 

and, since e, > 0 for n > oo by virtue of the continuity of /”’, 
: dn +1 , 

(4-9) lim oo a =f (s). 


This equation shows that the error at the (n + 1)st step is approximately 
equal to f’(s) times the error at the mth step. As s is unknown, the 
limiting ratio f’(s) of two consecutive errors is, of course, unknown, and 
all we really know is that the ratio of two consecutive errors approaches 
some unknown limit. We now shall show, however, how to obtain a 
significant improvement of the convergence of the iteration method by a 
judicious use of this incomplete information. 

Let us heuristically proceed under the assumption that (4-8) is exact 
with «, = 0 for finite values of n. We then have, writing f'(s) = A for 
brevity, 


Xn+1 — 5 = A(x, — 5) 
Xn+2 — S = A(Xq,41 — 5). 


It is an easy matter now to eliminate the unknown quantity A and solve 
for s. Subtracting the first equation from the second, we obtain 


Xn+2 — Xne1 = A(Xne1 — Xn), 


hence 


As Xn+2 — Xn +1. 
Xn+1 — Xn 


Solving the first equation for s and substituting for A, we get 


1 
Cae Tog One — Ax) 


1 
= Xn, + I_A (Xn4+1 <= Xn) 


(Xn+ Xn)? 


= x _— 
"  Xn42 — 2Xn41 + Xn 


If our assumption that «, = 0 were correct, we thus could obtain the exact 
solution from any three consecutive iterates X;, Xn41, Xn+2- In reality, of 
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course, the «, are not exactly zero. They are, however, ultimately small 
compared to f(s). We thus may hope that for n large the quantities 


ri (Xn41 = Xn). 
ia ee eau eae Xa 


yield a better approximation to s than the quantities x,. We are thus led 
to investigate the properties of the following algorithm: 


Algorithm 4.4 (Aitken’s A?-method) Given a sequence of numbers 
{xn}, generate from it a new sequence {x;,} by means of formula (4-10). 


Formula (4-10) can be simplified somewhat by means of the difference 
operator 4. If {x,} is any sequence, we write 


71, ee ee ma Os 2 ces 
Higher powers of the operator 4 are defined recursively. For instance, 


A?x,, = A(4x_) = AXn41 — 4Xn 
= Xn+2 — 2Xn41 + Xn 


It is now evident that formula (4-10) can be written thus: 


_ (Ax), 
Bee” LAX 





(4-11) 5 eee 


This notation, together with the fact that algorithm 4.4 was discussed by 
Aitken [1926], explains its traditional name. 


EXAMPLE 


3. The last column of table 4.2 (see example 1) contains the accelerated 
Aitken values of the sequence {x,}. It is seen that after six steps the 
sequence {x,} has converged to the number of digits given, whereas the 
original sequence {x,} has not even nearly converged after fourteen steps. 


Problem 


12. The convergence behavior typified by (4-8) is sometimes characterized by 
the statement that ‘“‘the number of correct decimal digits in x, grows at a 
fixed rate.’ Give a quantitative interpretation of this statement, allowing 
you to answer the following question: As n— oo, how many steps are 
necessary (on the average) to reduce the error by a factor ;'5? 


4.5 Aitken’s A?-Method 


After having discovered algorithm 4.4 in a heuristic manner, we now 
shall present a rigorous analysis of it. This analysis applies to arbitrary 
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sequences having certain convergence properties and is not confined to 
sequences arising through iteration. The basic result is as follows. 


Theorem 4.5 Let {x,} be any sequence converging to the limit s 
such that the quantities d, = x, — s satisfy d, 4 0, 


(4-12) dn +1 = (A + En) dns 


where A is a constant, |A| < 1, and «,->0 for n—> oo. Then the 
sequence {x;,} derived from {x,} by means of algorithm 4.4 is defined 
for n sufficiently large and converges to s faster than the sequence 
{x,} in the sense that 


Xn —S 





(4-13) 


—> 0, n—> ©. 


Proof. Applying (4-12) twice, we have 
dn+2 = (A + en4i)(A + en) dh. 


Hence 
Ax, = Xn+2 — 2Xae + X, 
= dy +42 — 2dr +1 7 d,, 
= [(4 - 1? + «J 4, 

where 


en = Alen + Engi) — Len + lnEnei- 
By virtue of ¢, — 0 it follows that also 
(4-14) e, > 0, n—> ©. 


We conclude that (A — 1)? + e, 4 0 forall sufficiently large n,n > Mo, say. 
It follows that 4?x, #4 0 forn > no; hence the sequence {x,} is defined for 
n> No. We have 


Ax, = 4d, = (A + & — 1)d, 


and hence, subtracting s from (4-11), 





= (Ax,) 

Xn —S = d,— APx. 
a _(A-—1+2,)' d, 
eer eas Bry 
_ &n— 26(A — 1) ~ & 
* (A — 1)? + &} ” 


The hypothesis that «, — 0 and (4-14) now insure that 


Xn — S  & — 2e,(A — 1) —- & 
a gee er a 


as desired. 
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The result of theorem 4.5 is immediately applicable to sequences 
generated by iteration, as shown by corollary 4.5. 


Corollary 4.5 Let the function f, in addition to the hypotheses of 
theorem 4.2, have a continuous derivative on J, and assume that 
f’ #0. Provided that x) #5, the sequence {x,} generated by 
algorithm 4.1 satisfies the hypotheses of theorem 4.5, and algorithm 
4.4 thus results in a speedup of convergence. 


Proof. As was shown at the beginning of §4.4, the hypotheses of the 
corollary suffice not only to show that d, #4 0, = 0,1,2,..., but also 
that (4-8) holds, which is precisely what is needed in theorem 4.5. 

Although originally motivated by the iterative procedure, the effective- 
ness of the 4?-acceleration is in no way confined to sequences generated 
by iteration. Some other instances where it may be applied are given in 
problems 15, 16, and in chapter 7. 


Problems 


13. Apply algorithm 4.4 to the sequences obtained in problems 5 and 7. 
14. Let x, be the nth partial sum of the infinite series > °_9 an, 


Xn = an. 


Cy 
uM 


Show that algorithm 4.4 in this case yields the formula 


, Gi+1 
Seri QAn+1 — An+e2 
15. Show that the hypotheses of theorem 4.5 are satisfied if the terms a, of 
the series considered in problem 14 satisfy 
Gn = (A + &,)z", 


where a is a constant, |z| < 1, and where the sequence {e,} tends to zero. 
(Hint: Introduce the quantities 
8, = sup |e;,|.) 
kEn 
16. Evaluate the sum 
wo 
S 1 
iH cosh(n log 2) 


to 5 decimal places. (Apply problem 14.) 
17*. Assume the quantities e, in theorem 4.5 are such that the limit 
(4-15) lim 244 = B 
nao En 


exists. Show that the convergence of the sequence {x;,} can be sped up 
even further by applying algorithm 4.4 once more. 


iteration 75 


18*. Show that condition (4-15) is satisfied if the sequence {x,} is generated 
by algorithm 4.1, provided that the second derivative f” of f exists, is 
continuous, and satisfies f’(s) # 0. 


4.6 Quadratic Convergence 


In §4.4 we had assumed that f’(x) 4 0 on the interval J, and thus in 
particular that f’(s) #0. We then obtained a convergence behavior 
characterized by relation (4-8). This is known as linear convergence. 
(The number of correct decimal places is approximately a linear function 
of the number of iterations performed.) Let us now investigate the 
asymptotic behavior of the error if f’(s) = 0. We first note that if this is 
known to be the case, then it is not necessary to verify all the hypotheses 
of theorem 4.2. 


Theorem 4.6 Let J be an interval (finite or infinite), and let the 
function f be defined on / and satisfy the following conditions: 


(i) fand f’ are continuous on J; 
(ii) the equation x = f(x) has a solution s, located in the interior of J, 
such that f’(s) = 0. 


Then there exists a number d > 0 such that algorithm 4.1 converges 
to s for any choice of xp satisfying |x) — s| S d. 


The conclusion of the theorem can be expressed by saying that the 
algorithm always converges when the starting point is “sufficiently close”’ 
to the solution. 


Proof. Let I, denote the interval [s — d,s +d]. Since s is in the 
interior of J, J, is contained in / if d is sufficiently small, d < dy say. Let 
L be given,0 < ZL < 1. By the continuity of f’, there exists d satisfying 
0<dsd,) such that [f(x -—f(s| = [f’'@)| SL for xe, An 
application of the mean value theorem now shows that for x € J, 


If — sl = [f@) —f()| S L|x — s| S$ Ld < d; 


thus, the values taken by fin J, lie in J,. The hypotheses of theorem 4.2 
are thus satisfied for the interval 7,, and as a consequence algorithm 4.1 
converges. 

Let us now assume, in addition to the hypotheses of theorem 4.5, that 
f” exists, is continuous, and does not vanish on J,. As at the beginning 
of §4.4, we then can show that if x9 # s, no x, will be accidentally equal 
to s, and that the iteration algorithm cannot yield the exact solution in a 
finite number of steps. 
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Using Taylor’s theorem with remainder (see Taylor [1959], p. 476) we 
find the following expression for the error 4,4; = Xn41 — S: 


dn+1 = Xn41—S 
I (Xn) — f(S) 
=f") dr + 47's + 9, dy) di. 
Here 86, denotes, as usual, an unspecified number between zero and one. 


By virtue of our assumption that f’(s) = 0 the above expression simplifies, 
and we get 





(4-16) dinar = $f"(s + 9, d,) d?. 

Since d, 4 0 and d, -> 0 for n> oo it follows that 
: dn + ” 

(4-17) lim qt = $f"(s). 


Relation (4-16) states the remarkable fact that if f’(s) = 0, the error at 
the (n + 1)st step is proportional to the square of the error at the nth 
step. This type of convergence behavior is known as quadratic conver- 
gence. It is frequently, if somewhat vaguely, described by stating that 
the number of correct decimal places is doubled at each step. 


EXAMPLE 
4. Let a>0, and let f(x) = $(x + a/x) for x > 0. The equation 


x = f(x) has the solution x = Va. It is easily seen that f’(V/a) = 0, and 
that f"(x) > 0, x > 0. It thus follows that the sequence defined by 


l a 
Xn+1 = > (x. + =) 


converges to Va quadratically, provided that xp is sufficiently close to Va. 
(Actually, it can be shown that the sequence converges for every choice of 


Xo > 0, see §4.9.) This algorithm for calculating Va is a special case of 
Newton’s method, which is discussed in the next section. 


Problems 


19. For what values of the constant M in problem 9 does the sequence 
defined by x, = f(xn_-1) converge quadratically to the solution s? 

20. Let the function f have a nonvanishing third derivative, and assume that 
s = f(s), f(s) = f(s) = 0. Show that in this case the limit of d,4,/d3 
exists for Xo sufficiently close to s. (The convergence is called cubic in 
this case.) 

21*. Let 
x? + bx 
cx? +d 





fx) = 
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Determine the constants b, c, d in such a manner that the sequence 
defined by algorithm 4.1 converges cubically to Vz. Use the algorithm 
thus obtained to calculate V 10 to ten decimal places, starting with x» = 3. 


4.7 Newton’s Method 


The reader might well be under the impression that the discussion of 
quadratic convergence in the last section is without much practical value, 
since for an equation of the form x = f(x) the condition f’(s) = 0 will be 
satisfied only by accident. However, it will now be shown that, at least 
for differentiable functions f, the basic iteration procedure of algorithm 
4.1 can always be reformulated in such a way that it becomes quadratically 
convergent. 

Let the function F be defined and twice continuously differentiable on 
the interval J = [a, b], let F’(x) # 0 for x € J, and let the equation 


(4-18) F(x) = 0 


have the solution (necessarily the on/y solution) x = s, where s lies in the 
interior of J. We have already observed that this solution can be found 
by applying iteration to the function 


f(x) =x + MF(X), 


where M is a constant that must satisfy certain inequalities (see problem 9). 
Unless we are lucky, the convergence of the iteration sequence thus 
generated is linear. We now ask: Can we determine a function h (depend- 
ing on f, but easily calculable) so that the iteration sequence generated 
by the function 

f(x) = x + A(X)F(X) 


converges to s quadratically ? 
In view of theorem 4.6 the sole condition to be satisfied by f (in addition 
to the obviously satisfied condition f(s) = s) is f(s) = 0. In view of 


S'(x) = 1 + A(x%)FO) + A(xX)F'(x) 
this yields 
h'(s)F(s) + A(s)F’(s) = -1 
or, since F(s) = 0, 
1 
h(s) _ FG) 
A simple way to satisfy this condition (we do not claim it is the only way) 
is to choose 


We) = — Ey 


78 elements of numerical analysis 


We are thus led to the following algorithm: 


Algorithm 4.7 Choose xo, and determine the sequence {x,} from the 
recurrence relation 


F(Xn) 


(4-19) Xn4+1 = Xn — F’(%n) 





aa 0 a eee eres 


Algorithm 4.7 is known as Newton’s method or as the Newton-Raphson 
method. The convergence of Newton’s method for starting values x, 
sufficiently close to s follows immediately from theorem 4.6, since the 
iteration function f(x) = x — F(x)/F’(x) was specifically constructed so 
as to satisfy f(s) = 0. If we assume, in addition to the hypotheses made 
above, that F” exists and is continuous, then it easily follows that f” is 
continuous, and hence that the convergence of the sequence defined by 
(4-19), if it takes place at all, is quadratic. 

Formula (4-19) has a very simple graphical interpretation. We approxi- 
mate the graph of the function F by its tangent at the point x,, that is, 
F(x) is replaced by 


F(Xn) a (x = Xn) F' (Xp). 


Setting this expression equal to zero and solving for x, we find equation 
(4-19) (see Fig. 4.7). Intuitively appealing as they may be, considerations 
such as these tell us nothing about the nature of convergence, nor are they 
easily extended to the case of systems of equations. 





Figure 4.7 
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4.8 A Non-local Convergence Theorem for Newton’s Method 


The results proved in §4.7 are still unsatisfactory inasmuch convergence 
is proved only for starting values “‘sufficiently close” to the solution s. 
Since the exact solution is unknown, the question how to find the first 
approximation x) remains unanswered. We now shall prove a result 
which explicitly specifies an interval in which the first approximation may 
be taken. The one new hypothesis which must be made is that F” does 
not change sign in the interval under consideration. 


Theorem 4.8 Let the function F be defined and twice continuously 
differentiable on the closed finite interval [a, b], and let the following 
conditions be satisfied: 


(i) F(a)F(b) < 0; 
(ii) F’(x) # 0, xe [a, 5]; 
(iii) F’(x) is either 20 or SO for all x € [a, 5]; 
(iv) If c denotes that endpoint of [a, 6] at which |F’(x)| is smaller, 
then 








Then Newton’s method converges to the (only) solution s of F(x) = 0 
for any choice of xp in [a, b]. 


Some explanation of the hypotheses is in order. Condition (i) merely 
states that F(a) and F(b) have different signs, and hence that the equation 
F(x) = 0 has at least one solution in (a, 6). By virtue of condition (ii) 
there is only one solution. Condition (iii) states that the graph of Fis either 
concave from above or concave from below. Condition (iv) states that 
the tangent to the curve y = F(x) at that endpoint where | F’(x)| is smallest 
intersects the x-axis within the interval [a, 5]. 


Proof. Theorem 4.8 covers the following four different situations: 


(a) F(a) < 0, F(b) > 0, F’'(x) $0 (c= 5); 
(b) F(a) > 0, F(b) < 0, F’(x) 20 (c = Bb); 
(c) F(a) < 0, F(b) > 0, F(x) 2 0 (c = a); 
(d) F(a) > 0, F(b) < 0, F’(x) S$ 0 (c = a). 


The cases (b) and (d) are readily reduced to the cases (a) and (c), respec- 
tively, by considering the function —F in place of F. (This change does 
not change the sequence {x,}.) Case (c) is reduced to case (a) by replacing 
x by —x. (This changes the sequence {x,} into {—.x,}, and the solution s 
to —s.) It thus suffices to prove the theorem in case (a). Here the graph 
of F looks as given in figure 4.8. 


I 
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Figure 4.8 


Let s be the unique solution of F(x) = 0. We first assume that 
asxo Ss. By virtue of F(x) S 0, it is clear that 


a me F(X) > 
X1 = Xo F’(Xo) = Xo. 





We assert that x, S 5s, X¥,+1 2 xX, for all values of mn. This being true for 


n = 0, it suffices to perform the induction step from n ton+ 1. If 
xX, Ss, then by the mean value theorem 
—F(X,) = F(s) — Fn) = (8 — Xn) FOR) 
where x, S$ xX Ss. By virtue of F’(x) S$ 0, F’ is decreasing, hence 
F'(xt) S F’(xn) 
— F(x,) = (s = Xn)F'(Xn) 

and 

F(x) 
F'n) 
Consequently, F(x,41) S Oand Xn42 = Xna1 — F(Xn+1)/F'(Xn41) 2 Xn41, 
completing the induction step. 

Since every bounded monotonic sequence has a limit (see Taylor [1959], 
p. 453) it follows that lim,.... x, exists and is Ss. Denoting this limit by 
q and letting n > oo in the relation 





Xn+1 = Xn SX, + (S — Xn) = 5. 





_ F(X») 

Xn+1 = Xn F'(x,) 
it follows by the continuity of F and F’ that 
Os 


and hence that F(q) = 0, implying that g = s. 
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Consider now the case where s < x) S$ b. Using once more the mean 


value theorem, we have F(x,) = F’(xf)(xo — 5), where s < x$ < Xo, and 
hence, since F’ is decreasing, F(xp) 2 (Xo — S)F’(Xo). It follows that 


= -_ F(Xo) < 
neo FG) = 


On the other hand F(xo) = F(b) — (6 — Xo)F'(xo) where xo S xo S 3B, 
hence 





Xo — (X% —S)=S. 


F(X) & F(b) — (6 — Xo)F'(6). 
By virtue of condition (iv) of the theorem we thus have 


F(X) F (Xo) F (b) 
¥1 = Xo — FG) = 0 — FG) = 7° — FO) 


Xo — (6 -— a) + (6b — xX) =a. 








+ (b — Xo) 


Hence a < x, Ss, and it follows by what has been proved above for the 
case a S Xo S 5s that the sequence {x,} converges to s. Thus the proof of 
theorem 4.8 is complete. 

The literature on Newton’s method is extensive. Convergence can be 
proved under various sets of conditions other than those of theorem 4.8, 
and it is also possible to give bounds for the error after a finite number of 
steps (see Ostrowski [1960]). The theorem given above, however, covers 


some of the most important special cases. 


4.9 Some Special Cases of Newton’s Method 


We now shall give some examples for the application of theorem 4.8. 
(i) Determination of square roots. Let c be a given number, c > 0, 
and let 


F(x) = x? —c¢ (x > 0). 


We wish to solve the equation F(x) = 0, ie., to compute x = Vc. 
Newton’s method takes the form 





eo ex, - Fe) 
n+1 ~~ “n F'(Xn) 
adh, ots xi oc 
ae op 
or 
l c 
(4-20) arr = 5 (% + =): 


We are thus led to the algorithm considered in example 4. Since 
F'(x) > 0, F’(x) > 0 for x > 0, we are in case (d) of theorem 4.8. In any 
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interval [a, b] withO < a < Vc < b the smallest value of the slope occurs 
at x = a, and it is easily seen that condition (iv) is satisfied for every 
b 2 4(a + c/a). Thus it follows that the sequence defined by (4-20) 
converges to Vc for every choice of x9 > 0. 

Formula (4-20) states that the new approximation is always the arith- 
metic mean of the old approximation and of the result of dividing c by 
the old approximation. For work on a desk computer it is not necessary 
to work with full accuracy from the beginning since every new value can be 
regarded as new starting value of the iteration. 


EXAMPLE 
5. c= 10, x9 = 3 yields 


Xo = 3 c/xX9 = 3.3 
x, = 3.15 c/x, = 3.1746 
X_ = 3.1622 c/x2 = 3.16225532 


x3 = 3.16227766 c/x3 = 3.16227766 


Since x, = X3 to the number of digits given, the last value is accepted as 
final. 

(ii) Finding roots of arbitrary order. Vf F(x) = x* — c, where c > 0 
and k is any positive integer, Newton’s method yields the formula 


xk —¢ 
Xn+1 = Xn cxe-1 
or 
1 ] 
(4-21) Xasi = (1 — )* +E ob 


Again the conditions of theorem 4.8 are satisfied for every interval 
[a, Db] if 0 < a < We and b is sufficiently large, and the sequence defined 


by (4-21) converges to Vc for arbitrary x. > 0. 
(iii) Finding reciprocals without division. For a given c > 0 we wish to 
determine the number 


alo 


s = 
This may be regarded as the solution of the equation 


FQ) =i ¢ =0. 
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Newton’s method yields 





Xn+1 = Xn 1 


xe 


xX, + (1 — cx,)x, 


or 
(4-22) Xp A = Oe): 


No divisions are necessary to compute the sequence {x,}. Since 


; 1 , 2 
Fx) = —% < 9, F"\(x) = 3 > 0 
for x > 0, we are in case (b) of theorem 4.8, and convergence of the 
algorithm is assured if we can find an interval [a, b] so thata < c1 <b 
and 


F(b) : 2 
F®~ b(be —- 1) S$ b— a. 
The last inequality is satisfied if 
a 1+ us 1 — ac 


Since a > 0 may be made arbitrarily small, this means that the sequence 
defined by (4-22) converges to c~* for any choice of x, such that 

0 < x, < 2c7*. 
EXAMPLE 


6. To calculate e~', where e = 2.7182183. Starting with x») = 0.3, we 
find 


Xo = 0.3 2 — exo = 1.1846 
x, = 0.355 2 — ex, = 1.0350106 
X_ = 0.367429 2 — ex, = 1.0012244 


X3 = 0.36787889 2 — exg = 1.00000150 
X4 = 0.36787994 2 — ex, = 1.00000000. 


The quadratic nature of the convergence is quite evident in this example 
(doubling of the number of zeros in the second column at each step). 


Problems 


22. Show that the function F(x) = x — e~ “satisfies the conditions of theorem 
4.8 in the interval [0, 1]. Hence solve the equation by Newton’s method, 
Starting with xo = 0.5. 
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23. Calculate V7 to nine decimal places. 

24. Determine a numerical value of 4 without division, beginning a Newton 
iteration with x») = 0.3. 

25. Show that the numbers defined by (4-20) satisfy 


aa Ve = (Ha E)” z 
xn + Ve Xo + Ve 


Se 1) ree 


Hence verify that the sequence {x,} converges to Vc quadratically for 
arbitrary choices of xo > 0. 
26. If (4-22) is started with x» = 1, show that 


_1-a- cj" 
oe c 
Deduce that the sequence thus generated converges for 0 < c < 2, and 


that the convergence is again quadratic. 
27*. How do we have to choose the function A in the formula 


ey eG, FO) 
I(x) =x F’(x) + noo Ae) 





such that the iteration defined by f converges cubically to a solution of 
F(x) = 0? 


4.10 Newton’s Method Applied to Polynomials 


Newton’s method is especially well suited to the problem of deter- 
mining the zeros of a polynomial in view of the simple algorithm that is 
available for calculating both the value of a polynomial and that of its 
first derivative. Let 


P(X) = Ayx™ + ayxN- 4---+ ay 


be a given polynomial of degree N. If z is any (real or complex) number, 
and if the constants bo, b,,..., by and Co, C),..., Cy_-1 are determined 
from the recurrence relationst of algorithm 3.6, 


(4-23) bo = Ao, b, = Zb,-1 + dy (n= 1,...,.N) 
(4-24) Co = bo, Cn = 2Cn_-1 + OD, (n=1,...,N—1) 
then we have by a special case of theorem 3.6 

by = p(z), — Cn-1 = p’(2). 


The quantities p(z) and p’(z) required in each step of Newton’s method can 


t We now write c, in place of x, to avoid confusion with the variable x. 
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thus be calculated very easily. The computation proceeds as indicated in 
scheme 4.10. 


ao ay a2°°° Ay-1 ay 

+ + 4 + | 

by > by > by +++ > by_-1 > by 

+ + 4 y 

Co > C1 = Costs = Cn~1 
Scheme 4.10 


—> indicates addition, = multiplication by z and addition. 


EXAMPLE 
7. To determine a zero of the polynomial 
P(x) = x8 — x? + 2x45 


near x = —1. (in table 4.10, the coefficients a,, b,, c, are arranged in 
columns rather than in rows.) 


Table 4.10 
an bn Cn Pip’ 
1 1 1 
X= —-1 — 7 
o= ) 4 7 0.142857 
5 1 
1 1 1 
=f — 2.142857 —3.285714  —0,010305 
PS 1.142857 2 4.448979 
5 — 0,084546 
1 1 1 
| — 2.129807 — 3.259614 
x2 = — 1.129807 2 4.406241 8.089006 0.002691 
5 0.021764 
1 1 previous 
7 a | — 2.132498 value of p’ 
X3 = — 1.132498 2 4.415050 retained! — 0.000004 
5 — 0.000035 
1 1 
~] — 2.132494 


e: — 0.000003 
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If z is a zero of the polynomial p, then, according to theorem 2.5b, p can 
be represented in the form 


(4-25) P(x) = (x — z)q(x), 
where 
g(x) = boxN-1 4+ byxN-2 4---4+ By 


is a certain polynomial of degree N — 1. Multiplying the two poly- 
nomials on the right of equation (4-25) and comparing coefficients of like 
powers of x, we find 
ay = bo, 
Ane. = —2b, + by-1, n= 1,2,...,N. 


Solving for 5, ,, and replacing n + 1 by n, we find that the second relation 
is identical with (4-23). Thus we have: 


Theorem 4.10 If z is a zero of the polynomial p, the coefficients 
bo, b;,..., by-1 defined by equation (4-23) are identical with the 
coefficients of the polynomial (x — z)~*p(x). 


EXAMPLE 
8. Accepting x, = —1.132494 as a zero of the polynomial p considered 
in example 7, we have 


q(x) = — oO a = x* — 2.132494x + 4.415037. 

Thus, having found a zero of a polynomial p of degree N, the poly- 
nomial g whose zeros are identical with those of p save the one which has 
already been determined is easily constructed. The remaining zeros of p 
can now be found more easily as the zeros of g. The process of passing 
from p tog is known as deflating p. By successive deflations, the zeros of 
a given polynomial thus can be found by working with polynomials of 
successively lower degrees. 

Unfortunately, the above process is subject to accumulation of round- 
off errors in view of the fact that the data of the problem (in this case, the 
coefficients of the given polynomial) enters the computation only at the 
first step. It is thus advisable to recheck each zero found by using it once 
more as a Starting value for a Newton’s process applied to the full 
(undeflated) polynomial p. 

We mention without proof that local convergence of Newton’s method 
can also be established for complex zeros and for polynomials with complex 
coefficients. However, for the determination of pairs of conjugate 
complex zeros of polynomials with real coefficients a more efficient method 
is available (see §5.6). 
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Problems 
28. Determine the unique positive zero of the polynomial 
p(x) = x8 — x? —x- 1 


by Newton’s method. 
29. Find the zero near 0.9 of the polynomial 


P(x) = 4x® — 5x 4+ 4x* — 3x3 + 7x? — 7x + 1. 


30. Prove the relation cy_, = p’(z) by differentiating the recurrence relation 
(4-21) and observing that the 5, are functions of z. 


4.11 Some Modifications of Newton’s Method 


Newton’s method requires the evaluation of the derivative F’ of the 
given function F. While in most textbook problems this requirement is 
trivial, this may not be the case in more involved situations, for instance if 
the function F itself is the result of a complicated computation. It 
therefore seems worthwhile to discuss some methods for solving F(x) = 0 
that do not require the evaluation of F’ and nevertheless retain some of the 
favorable convergence properties of Newton’s method. 

(i) Whittaker’s method (Whittaker and Robinson [1928]). The simplest 
way to avoid the computation of F’ is to replace F’(x,) in (4-19) by a 
constant value, say m. The resulting formula 


F(Xn) 

(4-26) Yao = %_ 
then defines, for a certain range of values of m, a linearly converging 
procedure, unless we happen to pick m = F’(s). If the estimate of m is 
good, convergence may nevertheless be quite rapid. Especially in the 
final stages of Newton’s process it is usually not necessary to recompute F’ 
at each step. 

(ii) Regula falsi. Here the value of the derivative F’(x,) is approximated 
by the difference quotient 


F(x) - F(Xn-1) 
Xn — Xn-1 


formed with the two preceding approximations. There results the 
formula 


= x, — Xn = nn) FO), 
(4-27) nea = Xn ~ B) — Fta-a) 
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This is identical with 


x = Xn-1F (Xn) = XnF(Xn-1). 
aie F(X) = F(Xn-1) 


but for numerical purposes the ‘‘incremental” form (4-27) is preferable. 

The algorithm suggested by (4-27) is known as the regula falsi. It is 
defined by a difference equation of order 2 and thus is not covered by the 
general theory given at the beginning of this chapter. A more detailed 
investigation (see Ostrowski [1960], p. 17) shows that the degree of con- 
vergence of the regula falsi lies somewhere between that of Newton’s 
method and of ordinary iteration. 

(iii) Muller’s method (Muller [1956]). The regula falsi can be obtained 
by approximating the graph of the function F by the straight line passing 
through the points (x,_1, F(x,-1)) and (x,, F(x,)).. The point of inter- 
section of this line with the x-axis defines the new approximation x,,, 
(see Fig. 4.11). 

Instead of approximating F by a linear function, it seems natural to try 
to obtain more rapid convergence by approximating F by a polynomial p 
of degree k > 1 coinciding with F at the points x,, X,_1,...,X,-,, and 
to determine x,,, aS one of the zeros of p. Muller has made a detailed 
study of the case k = 2 and found that this choice of k yields very satis- 
factory results. Since the construction of p depends on the theory of the 
interpolating polynomial, we postpone the derivation of Muller’s algorithm 
to chapter 10. 

(iv) Newton’s method in the case F’(s) = 0. Newton’s algorithm was 
derived in §4.7 under the assumption that F(x) # 0, implying in particular 
that F’(s) 4 0. Let us now consider the general situation where 


F(s) = F(s) =-+- = F™-%s) = 0, Fs) # 0, 





Figure 4.11 
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where m 2 1. If we set x = 5 + A, the iteration function for Newton’s 
process, 
a F(x) 
I(x) =x F'(x) 
can be expanded in powers of h to give 
(m!) 1 FO™(s)\h™ + O(A™*1) 


f(sth=asth— [(m — 1) IF™s\h™-2 + O(h™) 


See eee ee 
m 


From this we find 


1 — tm [Et MD-LO_ | _! 

a eae er 
Thus, if m 4 1, f(s) 4 0, and the convergence fails to be quadratic. 
However, the above analysis shows how to modify the iteration function 
in order to achieve quadratic convergence. If we set 


_ mF(x) 
fx) = x F(x)’ xs 
S, x Sos 


then a computation similar to the one performed above shows that 
f(s) = 0. By theorem 4.6, the sequence defined by 


mF(x,) 
FO) 
converges to s quadratically, provided that x is sufficiently close to s. 
For m = | this algorithm reduces to the ordinary Newton process. 

Admittedly the above discussion is somewhat academic, because only 
rarely in practice we have an a priori knowledge of the fact that F’(x) = 0 
at a solution of F(x) = 0. However, formula (4-28) has also been used 
in the early stages of Newton’s process if two solutions of F(x) = 0 are 
very close together. In this case m was chosen in a heuristic fashion 
somewhere between | and 2 (see Forsythe [1958], p. 234). 


(4-28) Xnsi = Xn n= 0, lees 


Problems 


31. Show that if the regula falsi is applied to the equation x? = 1, then, 
assuming that the errors d, = x, — 1 tend to zero, the stronger relation 


: An +1 —_ 1 
eda, 





holds. 
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32. Give a closed formula for x, if the equation x? = 0 is solved (a) by the 
ordinary Newton’s method, (b) by the modification of Newton’s method 
given by (4-28), and discuss the result. (Assume x) = 1 in both cases.) 

33. Show that if the equation [F(x)]” = 0 is solved by (4-28), there results the 
ordinary Newton’s method for solving F(x) = 0. 


4.12 The Diagonal Aitken Procedure 


We continue to discuss the problem of achieving quadratic (i.e., 
Newton-like) convergence by methods not requiring the evaluation of any 
derivatives. We recall that none of the substitutes for Newton’s method 
offered in the preceding section quite achieved quadratic convergence. 
Returning to equations written in the form 


(4-29) x = f(x) 


we now shall show that true quadratic convergence can be achieved 
without derivative evaluation by a modification of the Aitken acceleration 
procedure discussed in §4.4. 

This modification proceeds as follows. We start out, as we do in 
ordinary iteration, by choosing an initial value x, and calculating x, = 
(Xo); X2 = f(x1). Aitken’s formula (4-11) is now applied to x9, x1, Xe, 
yielding 


(x1 — Xo)? 


(4-30) Xo = Xo Xp — 2x, + Xo 


The number x5 = x‘? is used as a new starting value for two more itera- 
tions. Having calculated x{? = f(x), x = f(x?), we apply Aitken’s 
formula again, obtaining an accelerated value x(?, which in turn is used 
to start a new iteration, etc. If a denominator in Aitken’s formula 
happens to be zero, we set x(t) = x, thus in effect terminating the 
iteration. Schematically, the algorithm is described by the following 
table: 


x(0) x@ x2 

x J” xp 7” x2 ae 

xo xp x 
Scheme 4.12 


Since x, = f(Xo), X2 = f(x1) = fUf(%o)), the values x*+? can be 
thought of as being generated from x“ in a single iteration step by means 
of a function F defined in terms of fas follows. We set 


N(x) = fF) — 2f(%) + x 
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and put 
[f(x) — x? 
(4-31) F(x) = == wey “NOD 
x; N(x) = 0. 


Thus we are led to the following formal statement of the new procedure: 


Algorithm 4.12 Choose x, and determine the sequence {x‘*} 
recursively by 


(4-32) et FO), S01 2 oacus 
where F is defined by equation (4-31). 


This algorithm is sometimes known as Steffensen’s iteration, as it was 
first proposed by Steffensen [1933]. 

The questions arise whether Steffensen’s iteration is well defined, whether 
the sequence {x‘} converges to a solution of x = f(x), and whether the 
degree of convergence is higher than linear. We shall give affirmative 
answers under the following hypotheses: The equation x = f(x) has a 
solution x = s, and the function fis three times continuously differentiable 
in a neighborhood of x = s and satisfies f’(s) # 1. It will be shown that 
these hypotheses imply that F satisfies F(s) = s, is twice continuously 
differentiable in a neighborhood of x = 5, and satisfies F’(s) = 0. 
Quadratic convergence of the sequence {x‘”} for all x sufficiently close 
to s then follows as a consequence of theorem 4.6. 

As direct differentiation of F turns out to be cumbersome, we introduce 
an auxiliary function g by setting 


f(s +h) - 5 h 
ee ot. 20; 
f(s); h=0. 


The function g is still at least twice continuously differentiable near h = 0. 
The definition of g implies 


f(s+hA=s + hgh), 
Sf(s + A) = fs + hgh) 
= 5 + hge(h)g(hg(h)). 


If x = s + A, it follows that 
N(x) = fF) — 2f(e) + x = AGA), 


where 
G(h) = 1 — 2g(h) + hg(A)g(hg(h)). 
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Like g, the function G is twice continuously differentiable near h = 0; 
furthermore, since g(0) = f’(s) # 1, 


(4-33) G0) = [g(0) - 1P 40, 


showing that G(h) # 0 for |h| sufficiently small. Hence N(x) 4 0 for 
x # s, |x — s| sufficiently small, and by (4-31), 


= _ , lg) — 1P 
F(x) =st+h-—h Gh 
This representation of F also holds for h = 0; it shows that F, too, is 
twice continuously differentiable, and that F(s) = s. Furthermore, 
Fs) = lim Fs + h)—s 


h-0 


eas [g(h) — 1]? 
a Gh \ 
2 20) = 1 -. 
Gg 
by virtue of (4-33). 


EXAMPLE 

9. We apply Steffensen iteration to the equation x = e~* considered in 
example 1. Starting with x® = 0.5, we obtain the following values 
(arranged in the manner of scheme 4.12) 


0.500000 0.567624 0.567144 0.567143 
0.606531 0.566871 0.567143 
0.545239 0.567298 0.567143. 


The values in the top row are seen to converge very rapidly. 


Problems 


34. Apply Steffensen iteration to the solution of Kepler’s equation given in 
problem 5, where m = 1, E = 0.8. 
35. Apply algorithm 4.12 to the equation 


10 


x=— 
x 


starting with x = 3. Compare the sequence {x‘”} with the sequence 


obtained by Newton’s method for computing V 10. 
36. Give a somewhat simpler proof of the quadratic convergence of Steffen- 
sen’s iteration by assuming that fcan be expanded in powers of h = x — s. 
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37. Show that if, in addition to the hypotheses stated above, f’(s) = 0, then 
also F’(s) = 0. Thus in this case Steffensen’s iteration converges at least 
cubically (see Householder [1953], p. 128). 


4.13. A Non-local Convergence Theorem for Steffensen’s Methodtf 


As was the case with the corresponding result concerning Newton’s 
method, the convergence statement concerning Steffensen’s iteration 
proved in the last section is unsatisfactory because it guarantees conver- 
gence only for choices of x ‘‘sufficiently close” to the solution s. We 
now shall prove a result which guarantees convergence no matter where the 
iteration is started. The extra hypothesis which will be added concerns 
the signs of f’ and of f”. 


Theorem 4.13 Let J denote the semi-infinite interval (a, 00), and let the 
function f satisfy the following conditions: 


(i) fis defined and twice continuously differentiable on /; 
(ii) f(x) > a, x eT; 
(iii) f(x) < 0, xed; 
(iv) f"(x) > 0, xeEl. 


Then algorithm 4.12 defines a sequence which converges to the 
(unique) solution s of x = f(x) for any choice of the starting value 
x in J. 


Proof. By virtue of the hypotheses, the graph of the function f looks as 
indicated in figure 4.13 (where a = 0); it shows that the equation x = f(x) 
has a unique solution s. 





Figure 4.13 


+ This section may be omitted at first reading. 
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Let x > s. Then, by virtue of (iii), f(x) < s, and f({(x)) > s. Hence, 
by application of the mean value theorem, 
f(x) — 5 = f(x) — £9) =f) — 5), 
where s < f; < x. Furthermore, 
FF) -— s = ff) - f() 
= f'(t)\F@) — 5) 
where f(x) < tg < s. We thus find 


f(x) — x = f(x) — 5 — (* — 5) 
= [f'(t) — 1 — s) 


SF) — f0%) = fF) — 5 — FQ) - 5] 
= [f'(t2) — IQ) — 5] 
= [f'(t2) — Ia) — 5). 


Thus the expression N in equation (4-31) is given by 
SF) — 2f(x) + x = fF) — fE) —- FO) - *) 
= {f') — Wh (a) — L(G) — 1K — 5). 
The expression inside the braces may be written 
['(a) — 1P + Pe) - fo). 
This is positive, since by (iv), f’(t2) < f’(t,) and hence by (iii), 
[f'(t2) — f(s) > 9. 


Hence M(x) # 0, and the first definition of F applies for all x > s. Since 
N is a continuous function of x, it follows that F is continuous at each 
point x > s. (Continuity at x = s has already been established in the 
preceding section under wider assumptions.) 

Using the above work, we find forx > s 


F(x) —s = (1 — Q(x — 5), 


QO — Lf"(t1) ~~ 1}? 
L'a) — 1 + F's) —- PI) 


By the above, 0 < Q < 1, and it follows that 


and 


where 


(4-34) 


0< Fix)-—s<x-s 
or 
ss F(x) < x. 


Thus for x 2 s, the sequence {x‘”} is decreasing and bounded below by 
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s. It therefore must havea limitg = s. By the continuity of F, F(q) = q. 
This is possible only forg = s. Thus the relation 


lim x? = 5 
k- ow 
has been proved for the case x > s. 
If x = 5, then all x“? = 5s, and the result is trivial. Thus let now 
x <s. Inthat situation, f(x) > s,f(f/(x)) < s. The above computations 
remain valid, with the difference that now 


XS os, S< tg < f(x). 


We find again N(x) # 0; however, since the order of ft, and ft, is now 
reversed, the numerator in (4-34) now exceeds the denominator, and we 
have Q > 1. It follows that F(x) > s. Thus if x® < 5, then x? = 
F(x) > s, and the sequence {x‘”} decreases from k = 1 onward. Con- 
vergence to s now follows as in the case x > 5, 


EXAMPLES 

10. The hypotheses of theorem 4.13 are satisfied for f(x) = e~*, x > 0. 
Steffensen iteration thus will produce the solution of x = e~* for any 
choice of the starting value x > 0. 

11. Let c > 0, and put f(x) = c/x for x > 0. Again this function 
satisfies the hypotheses of theorem 4.13, and the Steffensen iteration 
converges to the solution s = Vc for any choice of x > 0. Since 
S(f(x)) = x, the function F ts given by 


1 Cc 
Hoorn el (a8) 
x—27+x 


and the iteration function is identical with that given by Newton’s method. 


Problems 


38. Prove results analogous to theorem 4.13 for the cases 


(a) OSf@™<1, f(x) > 9; 
(b) f()>1, f(x) > 0. 


39. Show that the function f(x) = A + B/x" satisfies the hypotheses of 
theorem 4.13 for x 2 A > 0, B> 0. Hence find the solution s > 1 of 
the equation x? = x? + 1. 


96 elements of numerical analysis 


Recommended Reading 


The theory of iteration of functions of one variable is treated very 
thoroughly in the books by Householder [1953] and Ostrowski [1960]. 
Householder also gives numerous references to earlier work. 


Research Problem 


Find bounds (exhibiting the quadratic nature of the convergence) for 
the error of the approximations generated by Steffensen’s iteration. 
(Similar bounds are known for Newton’s method, see for example, 
Ostrowski [1960].) 


chapter 5 iteration for systems of equations 


In this chapter we will show how to apply the basic iteration algorithm 
discussed in chapter 4 and its modifications to obtain solutions of systems 
of equations. The equations envisaged here are nonlinear. Inasmuch as 
linear equations are a special case of nonlinear ones, the algorithms 
discussed here are naturally applicable to linear systems. However, for 
the reasons mentioned in the preface, the many important algorithms 
that are especially designed for the solution of linear systems of equations 
are not treated in this book. 


5.1 Notation 


The algorithms discussed in this chapter are, in principle, applicable to 
problems involving any number of equations and unknowns. However, 
for greater concreteness, and also in order to avoid cumbersome notation, 
we shall consider explicitly only the case of two equations with two 
unknowns, These equations will usually be written in the form 


‘ x = f(x, y) 
oa y = g(x, y), 


where f and g are certain functions of the point (x, y) that are defined in 
suitable regions of the plane. Each of the two equations x — f(x, y) = 0 
and y — g(x, y) = 0 defines, in general, a curve in the (x, y) plane. The 
problem of solving the system of equations (5-1) is equivalent to the problem 
of finding the point or points of intersection of these curves. It will 
usually be assumed, and in some cases also proved, that such a point of 
intersection exists. Its coordinates will be denoted by s and ¢. The 
quantities s and ¢ then satisfy the relations 


s = f(s, f) 
t = g(s, ¢). 
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EXAMPLE 
1. Let f(x, y) = x? + y?, g(x, y) = x? — y®. The system (4-1) reads in 
this case 

x=x*4 y? 

pax a y%, 
The equation x? + y* — x = 0 defines a circle centered at (4,0), the 
equation x? — y? — y = 0 a hyperbola centered at (0, —4). Both the 
circle and the hyperbola pass through the origin, thus our system has 
the obvious solution s = t = 0. But an inspection of the graphs shows 
that there must be another solution near x = 0.8, y = 0.4. 


It will be convenient to employ vector notation. Thus we not only shall 
be able to simplify the writing of our equations, but also to aid the under- 
standing of the theoretical analysis and even the programming of our 
algorithms. We represent the coordinates of the point (x, y) by the 
column vector 

() 
x = . 
Jy 


The functions fand g then become functions of the vector x, whose value 
for a particular x we denote by f(x) and g(x). If we denote by f the 
column vector with components f and g, the system (5-1) can be written 
more simply as follows: 

(5-2) x = f(x). 


( 
ft 


is a solution of equation (5-1) is expressed in the form 
s = f(s). 
A vector analog of the absolute value of a number (or sca/ar, as numbers 


are called in this context) will be required. Clearly, the /ength of a vector 
is such an analog. If x = (x, y), we write 


(5-3) IIx|] = Vx? + y? 


The quantity ||x|| is called the Euclidean norm of the vector x. It is 
nonnegative, and zero only if x is the zero vector 0 = (0,0). Furthermore, 
if the sum of two vectors and the product of a vector and a scalar are 
defined in the natural way, we have 


(5-4) llex|| 


le| |ixll 


(5-5) IX. + Xall S [[xal] + {xl 
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Relation (5-5) is called the triangle inequality. If x, and x, are two 
vectors, then ||x, — X,|| is the distance of the points whose coordinates are 
the components of x, and xg. 

We shall have occasion to consider sequences of vectors {x,}. Such a 
sequence is said to converge to a vector Y, if 


|x, — v|| 0 for n— oo. 


The following criterion due to Cauchy is necessary and sufficient for the 
convergence of the sequence {x,} to some vector vy (see Buck [1956], p. 13). 
Given any number e > 0, there exists an integer N so that for alln > N 
and allm > N 

Xm — Xall < «. 


5.2 A Theorem on Contracting Maps 
Iteration in several variables is defined as in the case of one variable. 


Algorithm 5.2 Choose a vector Xo, and calculate the sequence of 
vectors {x,} recursively by 


(5-6) X= 10,53); m= 952323 3k: 
As in the scalar situation, there arise the questions whether the sequence 


{x,} is well defined, whether it converges, and whether its limit necessarily 
is a solution of the equation 


(5-7) x = f(x). 
All these questions are answered by the following result: 


Theorem 5.2 Let R denote the rectangular region a <x < Jb, 
cS yd, and let the functions f and g satisfy the following 
conditions: 


(i) fand g are defined and continuous on R; 
(ii) For each x € R, the point (f(x), g(x)) also lies in R; 
(iii) There exists a constant L < 1 such that for any two points x, 
and x, in R the following inequality holds: 
(5-8) f(x1) — F(x2)l| S Lx. — xl. 
Then the following statements are true: 
(a) Equation (5-7) has precisely one solution s in R; 
(b) for any choice of xy in R, the sequence {x,} given by algorithm 5.2 


is defined and converges to s; 
(c) for any n = 1, 2,..., the following inequality holds: 


(5-9) IX, — sll S 





i" lit: — Xl 
f= Z 1 Olle 
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Condition (5-8) is again called a Lipschitz condition. It here expresses 
the fact that the mapping x — f(x) diminishes the distance between any 
two points in R at least by the factor Z. For this reason a mapping with 
the property (5-8) is called a contraction mapping. 

It should be noted that already statement (a) is not as trivial now as in 
the case of one variable, where the existence of a solution could be inferred 
from an inspection of the graph of the function f. Statement (b) guaran- 
tees convergence of the algorithm, while (c) gives an upper bound for the 
error after n steps. 


Proof of theorem 5.2. The proof will be accomplished in several stages. 
From condition (ii) it is evident that the sequence {x,} is defined, and that 
its elements lie in R. Proceeding exactly as in the proof of (4-6) (§4.3) 
we now can show that (iii) implies 


(5-10) IXn+4 = x,|| = L"||x, = Xoll, c= 0, l, 2; se oe 
Now let n be a fixed positive integer, and let m > n. With a view towards 
applying the Cauchy criterion, we shall find a bound for |ix,, — x,|. 
Writing 
Xm — Xn, = (Xn41 — Xn) + (Xn¢2 — Xntia) Hee+ + (Xm — Xm-1) 

and applying the triangle inequality (5-5), we obtain 

[Xm _ X,l| Ss Xno1 ~~ Xnl| ot IIXn+2 a Xn+all a ar Xm = Xm—all 
and, using (5-10) to estimate each term on the right, 


(9-11) [Xm — Xall SL" + Leth +--+ L™ Ixy — Xoll 


n 


lA 


L 
l-L [Xi — Xoll, 
by virtue of0 < Z < 1. Since the expression on the right does not depend 
on m and tends to zero as n — 00, we have established that the sequence 
{x,} satisfies the Cauchy criterion. It thus has a limit s. Since R is 
compact, se R. 

We next show that s is a solution of (5-7). By virtue of the continuity 
of fand g, 


(5-12) lim f(x,) = f(s), 


and thus 
s= lim x, = lim x,,,; = lim f(x,) = f(s), 


as desired. 
Uniqueness of the solution s follows from the Lipschitz condition (iii), 
exactly as it does for one variable (see the end of §3.1). Relation (5-9) 
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finally follows by letting m— oo in (5-11). This completes the proof of 
theorem 5.2. No use has been made of the fact that the number of 
equations and unknowns is two; both the theorem and its proof remain 
valid in a Euclidean space of arbitrary dimension, and even in some 
infinite-dimensional spaces. 


Problems 


1. Prove the relation (5-12) by means of (i), without making use of the 
continuity of fand g. 

2. Prove that condition (iii) (even when L = 1) implies that both fand g are 
continuous at every point of R. 


5.3 A Bound for the Lipschitz Constantt 


We consider here the problem of verifying whether a given pair of 
functions (f, g) satisfies a Lipschitz condition of the form (5-8). The 
corresponding condition for a single function of one variable could be very 
easily checked by computing the derivative (see §4.1). If the absolute 
value of the derivative turned out to be bounded by a constant, then the 
Lipschitz condition was satisfied with that constant. A similar criterion is 
available for pairs of functions of two variables. 


Theorem 5.3 Let the functions f and g have continuous partial 
derivatives in the region R defined intheorem 5.2. Then the inequality 
(5-8) holds with L = J, where 


(5-13) J= max Vf2+f2+ eg? +g 
(ZWeER 

Proof. Let (xo, ¥o) and (x,, 1) be two points in R. We define for 
Os7t<1 

Xt = Xo + th, Ve = Vo + tk, 
where 

h = x, — Xo, kK=Vi — Yo 
and set 

Sit = S(%ts 1)s & = B(Xt, Vt). 


It is to be proved that 


(5-14) (fi — fo)? + (81 — 80)? S (Ch? + k*)J?. 
For the proof we shall require both the Schwarz inequality for sums, 
(5-15) (ac + bd)? S (a? + b*)\(c? + d?), 


tA reader who is willing to accept the statement of Theorem 5.3 may omit the 
remainder of the section without loss of continuity. 
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valid for any four real numbers a, 6, c, d, and the Schwarz inequality for 
integrals, 


6-16 — (f peraaar) sf ter ax [ear ae, 


valid for any two functions p and gq that are integrable on the interval 
[a,b]. Proofs of these well-known inequalities are sketched in the 
problems 3 and 4. 

By the chain rule of differentiation (see Taylor [1959], p. 590) we have 


df, 
dt = hf, + kfy, 
where 
Sc = fc(%s Ye), fy = fi% 2), 
and hence 
fr -fo= | We + Kf) dt 
By (5-15), 
(Af, + kf,)? s (A? + k*)\f? + f7). 
Hence 


(f, fo)? = 0 + ke [VIFF HR at]. 
We now use (5-16) where a = 0, b = 1, p(t) = 1, g(t) = 7? + 7)”. It 
follows that 
(fi — fo)? S (h? + k?) [: (f2 + f2) dt. 
In exactly the same manner we find 
(81 — 80)” S (h? + k*) [ (gr + gy) dt. 


Adding the last two inequalities and using f? + f? + g2 + g2? SJ, we 
obtain (5-14). 


EXAMPLE 
2. Let 
S(x,y) = Asin x + Bcos y, g(x, y) = Acosx — Bsiny 
where A and B are constants. We find 
SP + fe + 82 + By = A? + B?, 
thus (5-8) holds with L = VA? + B®. The conditions of theorem 5.2 
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thus are satisfied, for instance fora = —2,b = 2, whenever A? + B? < 1. 
Convergence of the process for A = 0.7, B=0.2, x9 = yo = 0 is 
illustrated by the values given in table 5.3. 


Table 5.3 
n Xn Yn 
1 0.200000 0.700000 
2 0.292037 0.557203 
3 0.371280 0.564599 
4 0.422927 0.545289 
28 0.526519 0.507921 
29 0.526521 0.507921 
30 0.526521 0.507920 
31 0.526522 0.507920 


Naturally, the condition of theorem 5.3 is not necessary for convergence 
of the iterative process. For instance for the problem considered in 
example 2 convergence also takes place for A = B = 0.71. 


Problems 


3. Prove (5-15) by observing that the equation for x, 
(a + cx)? + (b + dx)? = 0 


can have at most one real solution. 
4. Prove (5-16) by noting that the equation in x, 


[ (p(t) + xq(t))? = 


can have at most one real solution. What are the conditions on p and q 
in order that there is a real solution? 


5.4 Quadratic Convergence 

Suppose the functions fand g satisfy the conditions of theorem 5.2 and 
have continuous derivatives up to order 2 in R. The sequence of points 
(xn, ¥n) defined by algorithm 5.2 then converges to a solution (s, t) of the 
system (5-1). What is the behavior of the errors 


Gy = iS; 
en = Vn — b, 
asn-—> oo? 
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An application of Taylor’s theorem for functions of two variables (see 
Buck [1956], p. 200) shows that 


Gna1 = Xn41 —S 
= f(x, Yn) — f(s, t) 
= f(s aa dn, t + e,) if (S; t) 
= FAS, t) d, a Ss, te, + O((ld,.11?), 
and similarly 
Cn+1 = g2(s, t) dy + g(5, te, a O(|Id,||?). 


Here d, denotes the vector of errors, 


d,, = (*") ? 
en 


and O((|[d,||?) denotes a quantity bounded by C|ld,||?._ Introducing the 
Jacobian matrix of the functions f and g, 


Te - *), 
Sx By 
the above relations can be written in abbreviated form as follows: 
(5-17) di+1 = J(s, t) d,, i O(|Id,||). 


Relation (5-17) is the multidimensional generalization of (4-8). If 
J(s, t) # 0 (that is, if the elements of the matrix J(s, ¢) are not all zero), it 
shows that at each step of the iteration the error vector is approximately 
multiplied by a constant matrix. In this sense we again may speak of 
linear convergence. If J(s, t) = 0 (that is, if all four elements of J are 
zero), then we see that the norm of the error at the (n + 1)st step is of the 
order of the square of the norm of the error at the wth step. This is 
similar to what earlier has been called quadratic convergence. 

The following analog of theorem 4.6 holds in the case where J(s, t) = 0: 


Theorem 5.4 Let the functions fand g be defined in a region R, and 
let them satisfy the following conditions: 


(i) The first partial derivatives of f and g exist and are continuous 


in R. 
(ii) The system (5-1) has a solution (s, f) in the interior of R such that 
J(s, t) = 0. 


Then there exists a number d > 0 such that algorithm 5.2 converges 
to (s, t) for any choice of the starting point within the distance d of 
the solution. 


The conclusion can be expressed by saying that the algorithm always 
converges if the starting vector is “‘sufficiently close”’ to the solution 
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vector. The proof of theorem 5.4 is based on the fact that by virtue of 
the continuity of the first partial derivatives 


Vf2+fi+ gt gi=L<i 


in a certain neighborhood of (s, f)e R. It then follows from theorem 5.3 
that the conditions of theorem 5.2 are satisfied in that neighborhood. 
Further details are omitted. 


5.5 Newton’s Method for Systems of Equations 


We now shall consider the problem of solving systems of two equations 
with two unknowns which are of the form 


F(x, y) = 0, 
G(x, y) = 0, 


where both functions F and G are defined and twice continuously differ- 
entiable on a certain rectangle R of the (x, y) plane. We suppose that the 
system (5-18) has a solution (s, t) in the interior of R, and that the Jacobian 
determinant 


(5-18) 


F(x, y) F,(x, y) 
G(x, y) G,(x, y) 


is different from zero when (x, y) = (s, f). It then follows from a theorem 
of calculus (see Buck [1956], p. 216) that the system (5-18) has no solution 
other than (s, ¢) in a certain neighborhood of the point (s, £). 

Newton’s method for solving a single equation F(x) = 0 could be 
understood as arising from replacing F(x + 5) by F(x) + F’(x)é and 
solving the equation 


(5-19) D(x, y) = 





F(x) + F'(x)8 = 0 


for 5, thus obtaining a supposedly better approximation x + 6 to s than s. 
We now apply the same principle to the system (5-18). Assuming that 
(x, y) is a point “‘near”’ the desired solution (s, f), we replace the function 
F(x + 6, y + e) by its first degree Taylor polynomial at the point (x, y), 
that is, by F(x, y) + F.(x, y)6 + F(x, y)e. A similar replacement is 
made for the function G(x + 6, y + «). Setting the Taylor polynomials 
equal to zero, we obtain a system of two linear equations for 6 and e, 


F(x, y)8 + F(x, ye = —F(x, y), 


(5-20) Gul, )B-+ Gy(x, ye = GC, 9). 


The determinant of this system is just the quantity D(x, y) defined by 
(5-19). Since D is continuous and D(s, t) 4 0 by hypothesis, it follows 
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that D(x, y) # 0 for all points (x, y) sufficiently close to (s, ft). Thus for 
all these (x, y) the system (5-20) has a unique solution 6 = &(x, y), 
e = e(x, y). The point (x + 4(x, y), y + e(x, y)) is now chosen as the next 
approximation to (s, ¢). Algorithmically speaking, the procedure can be 
described as follows: 


Algorithm 5.5 (Newton’s method for two variables.) Choose 
(Xo, Yo), and determine the sequence of points (x,, yn) by 


Xn+1 = f(x: Yn) 
5-21 
( ) eee ered n = Q, a ee 


where the functions f and g are defined by 
f(x,y) =x + 8%, y), B(x y) = y + ex, y), 


and where 6 and « denote the solution of the linear system (5-20). 


We assert that the sequence (x,, y,) converges quadratically to (s, f) 
for all (xo, Yo) sufficiently close to (s, ¢). In order to verify this statement, 
it suffices by theorem 5.4 to show that all elements of the Jacobian matrix 
of the iteration functions f and g, 


(F209) fx Y) 
mar oe y) a(x, 


are zero for (x, y) = (s, ¢). Omitting arguments, we have 


fr, = 14+ 6, t, = &, 
(5-22) aes, 2,=1+e,. 


The values of the derivatives 6,, 5,, ez, ¢, are best determined from (5-20) 
by implicit differentiation. Differentiating the first of these equations, we 
get 

F375 + FS, + Frye + Fyer = —Fy, 

F,,5 + F,8, + Fyye + Fyey = —F,. 


Two similar relations involving the function G are obtained by differentiat- 
ing the second relation. We now set (x, y) = (s, t) and observe that 
d(s, t) = e(s, t) = 0. We thus obtain for (x, y) = (s, 4) 


F, 8, Ss Fe, = =is 
G,5, + Gye, = —G,, 
F,5, + Fye, = —F,, 
G6, + Gye, = —Gy. 


The determinant of each of these two systems of linear equations is again 
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the Jacobian determinant D(s, t) and hence is different from zero. We 
now easily find 
6, = ae by = 0, 


é, = 0, ey = —l. 


According to equation (5-22), this implies that all elements of the matrix J 
are zero at the point (s, f), as desired. 

For actual execution of algorithm 5.5 (although not for the above 
theoretical analysis) it is necessary at each step of the iteration to solve 
the system (5-20) for 6 and e«. An application of Cramer’s rulet yields 

















—F F, 
s-1<¢ G|_ GF, ~ FG, 
 |F, *F,| FG, — F,G, 
G, G, 
F, —F 
ua lGz —G|_ FG, - GF, 
 |F, F,|  #F,G, —F,G, 
G, G, 








EXAMPLE 
3. We solve the system considered in example 2 by writing it in the form 


x —Asinx — Bcosy = 0, 
y—Acosx + Bsiny = 0, 


and applying Newton’s method. The following values are obtained for 
A = 0.7, B = 0.2, xo = Vo = 0; 


n Xn Yn 

1 0.6666667 0.5833333 
2 0.5362400 0.5088490 
3 0.5265620 0.5079319 
4 0.5265226 0.5079197 
a 0.5265226 0.5079197 


The much greater rapidity of convergence is clearly evident. 


+ It is well known that Cramer’s rule should never be used for the solution of linear 
systems of any sizable order. The solution of such systems is much more con- 
veniently found by a process of elimination. For systems of small order (such as 
2 or 3) Cramer’s rule is perfectly applicable, however. 
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Problems 


5. Generalizing the approach outlined in §4.7, one might try to generalize 
Newton’s method to systems by suitably choosing functions A and k so 
that iteration of the functions 


f(x, y) = x + h(x, y) F(x, y) 
&(x, y) = y + K(x, y)G(, y) 
yields a quadratically convergent process. Show that this is not possible 


in general. 
6*. Determine the matrix H = H(x) such that application of iteration to 


f(x) = x + H(x)F(x) 


yields a quadratically convergent process. Show that the process 
obtained is identical with Newton’s method. 
7. Find a solution near x = 0.8, y = 0.4 of the system 


x—x? — yp? =0, 
, a a + y? = 0, 
using Newton’s method. 
8. Solve by Newton’s method the pair of equations 
x + 13 logx — y? =0, 
2 = ay = Sx 1 =] 0, 
starting with x9 = 3.4, yo = 2.2. 
9. Solve the system 
4x° — 27xy? + 25 = 0, 
4x? — 3y3 -1=0, 


by Newton’s method, starting with x9 = yo = 1. 


5.6 The Determination of Quadratic Factors 


In this section we shall prepare the ground for the application of 
Newton’s method to the problem of finding pairs of complex conjugate 
zeros of polynomials with real coefficients. We first consider the following 
preliminary problem. 

Given a polynomial p of degree N 2 2 with real coefficients, 


P(X) = Ayx™ + ayxN~* +++++4 ay, 


and a quadratic polynomial x? — ux — v, to determine constants 
bo, b;,..., by such that the identity 


(5-23) P(X) = (x* — ux — v)q(x) + by-i(* — u) + by 


holds, where 
g(x) = boxX 2 + byxX-34.---+ by_o. 
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Multiplying the polynomials on the right side of (5-23) and comparing 
coefficients of like powers of x, we find the following conditions on 
bo, b1,..., by: 

bo = Ao, 
by = a, + ubo, 
by = Ag + ub, + vbo, 


and generally form = 2,...,N — 2 
(5-24) Ds = an + ub, —1 + vb, - 2. 


By comparing the coefficients of x1 and x° it is seen that (5-24) also holds 
forn = N—1andn=WN. Furthermore, if we agree to put b_, = b_. 
= 0, the relation holds also for n = 0 and n= 1. Thus we find the 
following algorithm for determining the coefficients bo, b,,..., by in the 
representation (5-23) of the polynomial p: 


Algorithm 5.6 If ao, a;,..., @y, u, v are given numbers, determine 
bo, b;,..., by from the recurrence relation b_, = b_, = 0, 


b, = a, + ub,_1 + vb,z~2, N= Dele oo ag Ve 


Clearly, the coefficients bo, b,,..., by thus obtained are functions of the 
variables u and v. The reason for considering the b, is contained in the 
following theorem: 


Theorem 5.6 The polynomial x? — ux — v is a quadratic factor of 
the real polynomial p(x) = dox™ + a,xX~' +--+-++ dy if and only if 
bye = by = 0. 


Proof. (a) If by_, = by = 0, then (5-23) reduces to the relation 
P(x) = (x? — ux — v)g(x). 


It shows that a zero of x? — ux — v is also a zero of p._ By considering 
the derivative we find that a double zero of x? — ux — v is also a double 
zero of p. Thus x? — ux — v is a quadratic factor of p. 

(b) We now assume that x? — ux — vis a quadratic factor. Denoting 
its zeros by Z,, Z., we first assume that z; # Z2. We then have from (5-23), 
since z? — uz, — v = 0 (k = 1, 2) 


O = p(z,) = by_1(Z, — 4) + by, k= 1,2, 
or, written more explicitly, 


(2, — u)by_1 + by = 0, 
(Z2 = u)by 1 = by = 0, 


This homogeneous system of two linear equations has the determinant 


110 elements of numerical analysis 


Z, — Zq € 0, hence its only solution is by_, = by = 0. If the two zeros 
of the quadratic factor x? — ux — v coincide, then p(z,) = p’(z,) = 0, 
hence from (5-23) 

by-1(21 — u) + by = 0, 

by -1 a 0, 


and it follows again that by_, = by = 0. 

Theorem 5.6 shows that the problem of determining a quadratic factor 
of the polynomial p is equivalent to the problem of determining u and v 
such that 


by -1(u, v) = 0, 

b,(u,v) = 0. 

The solution of this pair of simultaneous equations by Newton’s method 
is known as Bairstow’s method. 


(5-25) 


5.7 Bairstow’s Method 


In order to apply Newton’s method to the system (5-25) we require the 
partial derivatives 


Oby -1 Oby -1 
Ou (u, v), Op (u, v), 
Oby Oby 

aie (u, v), rn (u, v). 


We shall obtain recurrence relations for these derivatives by differentiating 
the recurrence relations (5-25). 

We first differentiate with respect to u and observe that 0b,/du = 0. 
Writing 
= Obn +1 


(5-26) Cc; au 


for notational convenience, we obtain 


Co = Do, 
C= by a UCo, 
Co = bg + uc, + VC, 
and generally 
Cy = by + UCy-1 + UCy-2, 
where n = 0,1,2,...,N— 1, €-2 =c_, = 0. We see that the c, are 
generated from the 5, exactly as the b, were generated from the a,. From 
(5-26) we have 
Oby - Ob 
7 > me Cn-2; oF Cn -1: 





iteration for systems of equations 111 


We next differentiate with respect to v. We observe that now 
dbp/6v = 0b,/ev = 0. Thus, writing 


_ebasis 
(5-27) d, = 32, 


we obtain d_, = d_, = 0, 


dy = bo, 
d, = by + udp, 
d. = be + ud, + vdo, 
and generally 
d, = b, + ud,—-1 + vdn-2, 
n=0,1,2,...,.M— 2. These recurrence relations and initial conditions 


are exactly the same as those for the c,, hence we have d, = c,, n = 
0,1,2,..., M — 2, and from (5-27) we obtain 


Oby_ ob 
5 =e Cy-3> one = Cy -2- 





If the increments of u and v are denoted by 6 and e, respectively, their 
values as determined by Newton’s method must satisfy 


Cy-28 + Cy-g¢ = —by-1, 
Cy-15 + Cy-2& = —by 
and hence are given by 
_ byCy-3 — by-1CN-2 = byCy-1 — byCn-2 
(5-28) 6 = = aca eI a ai 
Cn-2 — Cn-1€N-3 Cn-2 — Cn-1€n-3 


The whole procedure is summarized in 
Algorithm 5.7 Given the polynomial 
P(X) = Apx™ + ayxN~2 4+ agxN~? +-+++4 dy 


and an arbitrary tentative quadratic factor x? — upx — vo, determine 
a sequence {x? — u,x — v,} of quadratic factors as follows: For each 
k =0,1,2,... determine the sequence {b,} = {6} from b_, = 
b_y = 4; 
b, = 4, + U;Dn—1 + U,Dn 25 A =): 1; acy dV 
and the sequence {c,} = {c%} from c_z = c_,; = 0, 
Ch = On + UCa-1 + UeCn-2, n=0,1,...,.N—1. 


Then set w4,.41, = “4, + 6, %% 41 = 0, + e, where 6 and e are given by 
(5-28). 
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Schematically the algorithm can be described as follows: 


Qo bo Co 
ay v 1, uv oan u 
ag ———> b,* ——>"c 


ay -3 by-s Cy-3 


an -2 by-2 Cy-2 
an-1 by-1 Cy-1 
ay by 

Scheme 5.8 


The coefficients that are needed in formula (5-28) are printed in bold face. 


EXAMPLE 
4. To determine a quadratic factor of 


P(x) = x8 — x? + 2x 455 


near x? — 2x + 5. Algorithm 5.7 yields for uy = 2, tp = —5 (watch the 
signs!) 


n An b, Ch 

0 l 1 | 36 +e= 1 

l —| 1 3 05 + 3e = 2 

2 —| 0 6 = O.11111111 
3 5 —2 e = 0.66666667 


Continuing with uw, = up + 6 = 2.111111, ey = vo + ¢ = —4.33333333, 
we find 


n an b, Ch 
1 1 l 3.222222225 + « 
= —0.01234568 
l —1 111111111 3.22222222 2.4814814885 4 3.22222222¢ 
= —0.2112483 
2 Z 0.01234568 2.48148 148 § = 0.02170139 
3 5 0.21124803 e = 0.08227238 


yielding wg = 2.1328125, rg = —4.41560571. Proceeding in a similar 
manner, we find 


Us = 2.13249371, tr 
U4 = 2.13249369, U4 


— 4.41503560, 
— 4.41503564. 
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Calculating the 5, with the last values, we obtain 


an ba 


l 1.00000000 
—1 1.13249369 
2 0.00000000 
5) 0.00000000 


WN | Ss 


indicating that convergence has been achieved. The two complex zeros 
are thus found to be 


“uesét«; 2 
aantsil—(4 —v 
= 1,.06624685 + 71.81056712. 


Problems 
10. Determine all zeros of the polynomial 
P(x) = x* — 8x? + 39x? — 62x + S51 
using the fact that 
P(x) = (x? — 2x + 2)(x? — 6x + 25) + 1. 
11. Determine quadratic factors of the polynomial 
P(x) = 3x® + 9x® + Ox* + 5x? + 3x7 + 8x + 5 

near x? + 1.541451x + 1.487398 and x* — 1.127178x + 0.831492. Also 

determine the real zeros near — 1.86 and 0.72. 
5.8 Convergence of Bairstow’s Method 


In order to justify the application of Newton’s method to the problem 
of solving (5-25) we have to show, according to §5.5, that the Jacobian 
determinant 


(5 29) ‘ ' ou (u, v) Op (u, v) 
- u, vu = 
( chy Chy 


SX (yr) FE ye) 


is different from zero for (u, v) = (s, ft), where x? — sx — t is a quadratic 
factor of the polynomial p. One condition under which this is the case is 
as follows. 
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Theorem 5.8 Let x? — sx — t bea quadratic factor of the polynomial 
p, and let its zeros z,, Z2 be two distinct, simple zeros of p. Then 
D(s, t) # 0. 


Proof. We differentiate the identity (5-23) with respect to both u and v. 
Since p does not depend on u or », the result is 
oq(x) , Oby-1 cm Oby 


Ou - Cu ( u) — by-1 + HH 








0 = —xq(x) + (x? — ux — v) 


Oq(x) , Oby-1 


0 = 9x) + (x? — wx — 0) WE) 4 Puma (yyy 4 Bee 





Setting here u=s, v=t, x =z, (kK = 1,2) we obtain by virtue of 
zi — sz, — t = 0 (k = 1, 2) the four relations 





(5-30) On ae Z, — S)+ Pa = 2144x (k = 1, 2), 
by - 0b 
(5-31) oe , (2 a? 5) a a = 4 (k = l, 2), 


where q; = q(Z,), and where the derivatives are taken at (u, v) = (s, 2). 
The two relations (5-30) can be regarded as a system of two linear equa- 
tions for the two unknowns @by_,/éu and dby/éu with the nonvanishing 
determinant z,; — Zz. We thus have 


Oby-1 = 2191 2242 Oby 3 212242 — 41) + S(Z191 — Z2G2) 
Ou 21 — Zo Ou Z1 — 22 


Solving the system (5-31) in a similar manner, we find 


Oby -1 Ps biel LF Oby _ 2192 — 2291 + 8(91 — Yo). 
Ov Z1 — Ze Ov Z1 — Ze 


Some algebraic manipulation now yields 





Oby-, Oby-1 
Pepe ke (2,)9(z2) 
s,t) = a = q(z,)q(Zo). 
aby aby W192 = WN21)4\Ze2 
Ou Ov 


Since z, and zz, are both simple zeros of p, we have q(z,) #4 0, k = 1, 2, 
and the conclusion of the theorem follows. 

The theory given in §5.5 now shows that under the hypotheses of 
theorem 5.8, algorithm 5.7 actually defines a sequence of quadratic poly- 
nomials {x? — u,x — v,} which converges to x? — sx — t whenever the 
initial quadratic x* — uox — vp is sufficiently close to x? — sx — t. 
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For most polynomials even crude approximations to quadratic factors 
are hard to obtain by mere inspection. In chapter 8 we shall discuss a 
method that will automatically produce good first approximations to 
quadratic factors (with complex zeros) of almost any real polynomial. 


5.9 Steffensen’s Iteration for Systemst 


We now return to ordinary iteration as applied to systems of equations 
(algorithm 5.2). Our goal is to extend the Aitken-Steffensen formula 
(4-11) to systems of equations. As explained in §4.4, the rationale behind 
that formula consisted in neglecting the small term e, in (4-8). Proceeding 
in the same vein, we now assume heuristically that the asymptotic error 
formula (5-17) is true without the term denoted by 0((Id,||?). 

Denoting by {x,} a sequence of vectors generated by algorithm 5.2 and 
by s a solution of x = f(x), our assumption implies that 
(5-32) Xn+1 —S = J(x, — 8), 
for n = 0,1,2,..., where J = J(s) denotes the Jacobian matrix of the 
function f taken at the solution s. The problem is to determine s from 
several consecutive iterates X,, X,41, Xn42,-.-, notwithstanding the fact 
that J is unknown. 

Subtracting two consecutive equations (5-32) from each other, we find, 
using the symbol 4 to denote forward differences, 


(5-33) Ax,4, = J Ax,, n=0Q,1,2,.... 
We define X,, to be the matrix{ with the columns x, and x,,,,, 
Xn  Xn+1 
Dn ‘oe 
Defining 4X, in the obvious way, (5-33) shows that 
JAX, = 4X41, N= O51, 23ceus 
If the matrix 4X, is nonsingular, we can solve for J, finding 
(5-34) J = AX,,,:(4X,) 7+. 
We now solve (5-32) for s. Assuming that I — J is nonsingular 
(I = unit matrix), we get 
(i — J)s = Xn41 — IX, 
= (I — J)x, + 4x, 


X, = (Xn, Xn+1) = ( 


+ This section may be omitted without loss of continuity. 
t Here we use implicitly the fact that the vectors considered have two components. 
In the case of N components we would have to put 


Xn = (Xn, Xn41, +++) Xn¢n-1): 
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and hence 
s=x, + (I — J)7! 4x,. 


Using equation (5-34) and applying the matrix identity (AB)~? = B-1A7! 
we have, always proceeding in a purely formal manner, 

(I — J)-* = d — 4X,41(4X,)-*)-? 
(4X, — 4X,,+1)(4X,)~*)-? 
—AX,(4?X,)7? 


and hence finally 


$= X, — AX,(4?X,)~* 4x,,. 


This formula for the exact solution has been derived under the assump- 
tion that (5-17) is true without the O((|d,||?) term. If this term is present, 
we still may hope that the vector 


(5-35) x, = x, — 4X,(A?X,)-! 4x, 


is closer to s than x,, provided that the matrix 4?X, is nonsingular. It 
will be noted that the formula (5-35) is built like (4-11), with the difference 
that certain scalars are now replaced by appropriately defined vectors and 
matrices. 

Formula (5-35) can be used in either of two ways. Either we can apply 
it to a sequence of vectors {x,} already constructed to obtain a sequence 
{x;,} which presumably converges to s faster. More effectively, the 
formula can be used as in algorithm 4.12, as follows. 


Algorithm 5.9 Choose a vector x‘, and construct the sequence of 
vectors {x‘”} as follows: For each k = 0,1,2,..., set x) = x™, 
calculate x,, X2, X3, from 


Xiaq = 1x); n= 0, 1, 2, 
and let x** = xo, where xq is defined by (5-35). 


This algorithm has not yet been fully investigated from the theoretical 
point of view. As it stands, it is not even fully defined, since there is no 
indication of what is to be done if the matrix 4?X, is singular. It is 
therefore impossible to prove that the algorithm converges, let alone that 
it converges quadratically. Substantial experimental evidence, and also 
some theoretical considerations, seem to indicate, however, that the 
algorithm is indeed quadratically convergent in a large number of cases, 
even when ordinary iteration diverges. 
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EXAMPLE 

5. To find the solution of the system 
x= x? + y?, 
ya x?— y, 


near (0.8, 0.4). Algorithm 5.9 yields the following values: 


Table 5.9 
k x ue Xn Yn 
0 0.8 0.4 
0.8000000 0.4800000 
0.8070400 0.4096000 
0.9253683 0.5898240 
1 0.7741243 0.4194303 
0.7751902 0.4233468 
0.7801424 0.4216974 
0.7864508 0.4307934 
2 0.7718671 0.4196500 
0.7718850 0.4196728 
0.7719317 0.4196813 
0.7720109 0.4197462 
3 0.7718445 0.4196434 
0.7718445 0.4196434 


The fact that x, = x, for k = 3 indicates that convergence has been 
accomplished. It is evident in this example that the sequences {x,} would 
not converge, as they begin to diverge for each k already for small values 


of n. 


Problems 


12. Assuming (5-32) to be exact, show that applying Aitken’s formula (4-11) 
individually to each component of x does, in general, not produce the 
correct solution vector s. 

13*. Assuming (5-32) to be exact, show that the matrix 4?X, is singular if and 
only if at least one of the following situations obtains: 


(i) Xn = S; 


(ii) X, — S§ is an eigenvector of J; 
(iii) the matrix J is similar to a matrix cI, where c is real. 


Show that in the cases (i/) and (di) the procedure described in problem 12 


is effective. 


14. Find the solution of the system considered in example 2 by means of 


algorithm 5.9. 
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Recommended Reading 


The abstract background of the method of iteration is discussed in most 
textbooks on functional analysis; see, for example, Liusternik and 
Sobolev [1961], p. 27. A beautiful discussion of Newton’s method for 
systems of equations is given by Kantorovich [1948] (see also Henrici 
[1962], pp. 367-371). 


Research Problems 


1. Formulate Newton’s method for systems of equations with an 
arbitrary number of unknowns. 
2. Generalize Bairstow’s method to extract from a polynomial of degree 
N a factor of arbitrary degree n < N. 
3. Discuss Newton’s method in the “‘singular”’ case when the determinant 
(5-19) is zero at the solution (s, 7). 
4. Discuss experimentally the stability of algorithm 5.9 when the matrix 
A?X,, is nearly singular. 
5. Develop a theory of iteration for systems if the vectors x, are formed 
according to 

Xn+1 = I (Xn; Yn)s 

Yael = 8(Xn+15 Yn) 


(i.e., the most recent information is used at each step). What is the 
correct formulation of the 4? procedure in this case? 


chapter 6 linear difference equations 


Linear difference equations were first encountered in §3.3, where the main 
topic was the study of linear difference equations of the first order. In 
the present chapter, we shall consider linear difference equations of 
arbitrary order. The theory of such difference equations is required for 
the understanding of some important algorithms to be discussed in the 
chapters 7 and 8. Furthermore, linear difference equations are a useful 
tool in the study of many other processes of numerical analysis; see in 
particular the chapters 14 and 16. Much unnecessary complication is 
avoided by considering difference equations with complex coefficients and 
solutions. 


6.1 Notation 


We recall that a linear difference equation of order N has the form 
(6-1) AonXn + AynXn-1 bet AyaXn-n = by. 


Here {Gon}, {@i,n},---, {Qn.n} and {5,} are given sequences, and {x,} is a 
sequence to be determined. The difference equation (6-1) is called 
homogeneous if {b,} is the zero sequence, i.e., if all its elements are zero. 

Although much of the subsequent theory can easily be extended to the 
general case, we shall be concerned exclusively with the case of linear 
difference equations with constant coefficients. In this case, a,., = a, for 
all values of n and for certain (real or complex) constants dp, a,,..., ay. 
(It is not required that 5, = b, however.) Without loss of generality we 
may assume that a) # 0, ay # 0, for otherwise the order of the difference 
equation could be reduced. Dividing through by a, and renaming the 
constants and the elements of the sequence {b,}, the linear difference 
equation with constant coefficients appears in the form 


(6-2) Xn + AyXn-1 + AeXn-g +°°° + AyXn-y = by, 
where dy # 0. 
119 
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We can aid the understanding by two notational simplifications. First, 
we shall no longer refer to a sequence by explicitly exhibiting its elements, 
as in the symbol {x,}, but instead denote a sequence by a single capital 
letter. The nth element of a sequence X will be denoted by x,, or some- 
times also by (X),. Secondly, if X = {x,} is any sequence, we shall denote 
by #X the sequence whose nth element is given by 


(LX)n = Xn + QyXn-1 a Q2Xn-2 5 ae 2 AnXn-N- 


With these notations, the problem of solving the difference equation (6-2) 
is the same as the problem of finding a sequence X such that 


(6-3) LX =B, 


where B denotes the sequence of the nonhomogeneous terms 5,,. 
We note that the operator ¥ defined above is a Jinear operator. Defin- 
ing the product aX of a scalar a and of a sequence X by 


(aX), = AX)n 
and the sum of two sequences X and Y by 
(X + Yn = (Xa + (Yn 
we have for arbitrary scalars a and b and for any two sequences X and Y 
(6-4) L(aX + bY) =aLX + bLY. 


The comprehension of the above notation may be helped by considering 
analogous simplifications in the theory of functions. Writing X in place 
of {x,} is much like writing fin place of f(x) to denote a function. The 
operator FY plays a role similar to that, say, of the differentiation operator 
D, which associates with a function fits derivative Df = f’. Relation (6-4) 
is the analog of the familiar fact that differentiation is a linear operation, 
ie., that D(af + bg) = aDf + bDg. 


6.2 Particular Solutions of the Homogeneous Equation of Order Two 


We shall consider in some detail the case N = 2. Here we have, for 
some constants a, and a, # 0, 


(6-5) (LX )n == Xn + AyXn-1 + AgXn-2- 


Our first task is to find solutions of the homogeneous equation #X = 0. 





+ Here the symbol 0 does not denote the scalar zero, but rather the sequence whose 
elements are all zero. Since no misunderstanding is possible, we shall not attempt 
to make a graphical distinction between the two concepts. 
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Let us take a look at the corresponding problem in the theory of 
differential equations. The linear homogeneous differential equation of 
order 2 with constant coefficients, 


“" , , dx 
x” + ayx' + agx =0 (x = 5) 
has, for suitably chosen r, solutions of the form e”. Is the same true also 
for the difference equation 


(6-6) Xn + QAyXn-1 + AgXn-2 = 0? 


Replacing ¢ by the discrete variable n, we are tempted to seek solutions of 
the form x, = e = (e’)", or, putting e’ = z, of the form x, = z*. With 
this definition of X = {x,} we have 


(ZX), = 2" + az"! + anz*-? 
= 2"~2(7? + ayz + ao). 


This expression is 0 not only if z = 0O—which would yield the trivial 
solution of the difference equation—but also if z is a zero of the polynomial 
p defined by 


(6-7) P(2) = 27 + ayz + ay. 


This polynomial is called the characteristic polynomial of the difference 
equation (6-6). We know from the theory of equations that p has exactly 
two zeros, z, and Z., which may be real or complex, and which may 
coincide. Ifz, # Z2, then the sequences with elements z7 and z$ represent 
two distinct, nonzero solutions of #X = 0. 

We can also find two distinct solutions if z} = z 2. We recall that if z, 
is a zero of multiplicity >1 of p, then by theorem 2.6 not only p(z,) = 0 
but also p'(z,) = 0. This suggests finding a second solution by differentia- 
tion. If (XY), = z", we have identically in z 


(LX)n = 2" + ayz"~* + agz"~? 
eeig"Ap(Z). 
Differentiating with respect to z, we get 
nz™-1 + a,(n — 1)z"~2 + a(n — 2)z"-8 
= (n — 2)z"~3p(z) + 2"~2p'(z). 
For z = z, the expression on the right is zero for all n, showing that the 


sequence with nth element nzj~’ is also a solution. We thus have 
obtained: 


Theorem 6.2 Let a, and a, #0 be constants, and consider the 
difference equation “X = 0, where & is defined by (6-5). If z,, Ze 
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are two distinct zeros of the characteristic polynomial, then the two 
sequences X‘, X® defined by 
(x), = 215 Ce), = 25 
are solutions of “XY = 0. If z, is a zero of multiplicity 2, then the 
sequences defined by 
(X), = zh (KX), = nz 
are solutions. 


EXAMPLES 
1. We consider the difference equation x, = X,-1; + X,-2. The 
characteristic polynomial p(z) = z? — z — 1 has the two zeros 


1+ V5 1— v5 


= ’ — 


21 9 22 9) 








b 


thus two solutions are given by 
(x), = ( + v) , (x), = (: = v3) : 
2 2 
2. Consider x, — 2x,-1 + X,-2 = 0. The characteristic polynomial 
is p(z) = z? — 2z + 1; z, = 1 is a double zero. Thus we find the two 
solutions 








XO), = 1) = 7 (XS) = ner aan, 


as can be verified directly. 


6.3. The General Solution 


Throughout this section, # will denote the difference operator defined 
by (6-5), where ag # 0. The first tool is the following: 


Lemma 6.3a If X° and X™ are any two solutions of #X = 0, and 
if c; and c,. are any two constants, then the sequence 


X = cy xXx + ax 
is also a solution. 
Proof. By the linearity of the operator 2, 


LX = L(y XY + Co X) 
= GLX” + ce, LX® = 0, 


As an exercise in the application of lemma 6.3a, we consider the problem 
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of finding real solutions of real difference equations. Let ¥ be defined 
as above, where a, and a, are real. The characteristic polynomial 


P(z) = z? + ayz + Qo, 


although a polynomial with real coefficients, may still have nonreal zeros 
Z1, Zz. However, these zeros according to theorem 2.5d then are complex 
conjugate, i.e., 

Zo = 24; 


or, denoting by Re a and Im a the real and imaginary parts of a complex 
number a, 
Re Zz, = Re Zp, Im z, = —Im Z,. 


If the two zeros are nonreal, two distinct solutions are given by 
(X™), = 21, (xX), = (Z,)”. 


Since the product of complex conjugate numbers is equal to the complex 
conjugate number of the product, we can also write 


(X®), = (23). 
It now follows from lemma 6.3a that the two real sequences Y, Y 
defined by 


(Y), = Het + A] = Re zi, 
1 — 
(Y), = 5 [t— Fl = Ima, 


are likewise solutions of the difference equation. 


EXAMPLE 
3. Let —1 < ¢ < 1, and consider the difference equation 


(6-8) X_ — 2tXy-1 + Xn-2 = O. 
The characteristic polynomial p(z) = z? — 2tz + 1 has the zeros 
z=t+ivVl—®, 2=t-ivl —#. 
If we define the angle by the condition 
cosy =t (0 < 9 < zm) 
then the zeros appear in the form 


Z, = cosy + ising = e”, 
Zg = COSp — ising =e, 
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Consequently, the two solutions given by theorem 6.2 appear in the form 


(XM), = (el9)" = ene, 
(X¥),, __ (e~')" =e eno 


and are complex. However, in view of e"° = cos np + isin ng, also the 
sequences defined by 


(Y), = cos ng, (Y), = sin np 


are solutions. This may be verified directly by using the addition 
theorems of the trigonometric functions. 

It is readily seen that the elements of the sequence Y‘” are polynomials 
of degree n in the variable t = cosg. In fact, (Y™), = 1, (Y™), = 1, 
and (6-8) shows that if this property holds for the integers — 2andn — 1, 
it holds for the integer n. Conventionally, these polynomials are denoted 
by 7,,(f) and are called Chebyshev polynomials of degree n. By the above 
we can also write 

T,(t) = cos (n arc cos £). 


The Chebyshev polynomials are important in many branches of numerical 
analysis (see §9.4). 


We return to the problem of finding a// solutions of #X = 0. Our 
second tool is the following simple observation. 


Lemma 6.3b_ Let / be a set of consecutive integers, and let the integers 
mand m-—1 bein J. Let the sequence B = {b,} be defined on J. 
Then the difference equation “#X = B has precisely one solution 
which assumes given values forn = mandn =m — 1. 


Proof. Let X® and X® be two solutions having the same values at 
n=mandn=m-—1. Thesequence D = X‘ — X® = {d,} then is a 
solution of “X = 0 assuming the values 0 for nm = mandn=m-— 1. 
If ¥ and XY are not identical, then d, 4 0 for some n > m or some 
n<m-—41.. To fix ideas, let d, 4 0 forsomen > m. If we denote by 
n the smallest integer for which this is the case, then, because D is a 
solution, 
d, + @,d,_1 + Aedn_2 = 9. 


However, since d,_, = d,_. = 0, d, # 0, this equation reveals a con- 
tradiction. The case where some d, 4 0 with n < m — 1 is dealt with 
similarly, making use of the fact that ag #0. It follows that D is the 
zero sequence, and hence that the sequences X‘” and X™ are identical. 
The zero sequence is a solution of YX = 0 with the property that its 
elements are zero for any two consecutive values of n. Lemma 6.3b 
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shows that this property characterizes the zero solution. In other 
words: If a solution of #£X = 0 vanishes at two consecutive integers, it 
vanishes identically. 

Let now X¥ = {x} and X@ = {x2} be two known solutions of 
LX = 0, defined for —0co < n < o, and let X = {x,} be an arbitrary 
third solution. Is it possible to find constants c, and c, such that 


X= ¢yX + coX? 


If that is the case, then we are entitled to call the sequence c,X + coX 
the general solution of # X = 0, for any special solution can be obtained 
by assigning special values to the constants c, and Cg. 

By lemma 6.3a, the sequence c, X“ + c.X™ is in any case a solution of 
LX =0. By lemma 6.3b, this solution is identical with X if it agrees 
with Y at two consecutive integers, n and n — 1, say. In order to obtain 
this agreement, we must be able to determine the constants c, and cz such 
that the two equations 


(6-9) { CixW + CoxW? = Xn, 
1 a 
CyXn + Cox = Xa-1, 


are satisfied. By the theory of linear equations, this system has a solution 
for arbitrary x, and x,_, if its determinant 


xD (2) 
Wr = 








1 2 
xo 1 x 2 1 
is different from zero. 


The determinant w, is called the Wronskian determinant of the sequences 
X™ and X® at the point n. We have obtained 


Theorem 6.3 Let X“ and X® be two particular solutions of 
LX =. Then every solution of #X = 0 can be expressed in the 
form c,X + coX@ if and only if the Wronskian determinant w, 
of X™ and X™ is different from zero for all values of n. 


The requirement that w, # 0 for all n is less stringent than it appears, 
as will be seen presently. 


EXAMPLES 
4. Let the two zeros z, and Z, of the characteristic polynomial p of 
LX = 0 be different. We calculate the Wronskian of the two solutions 


Xn = Zi, Xn? = 25 
given by theorem 6.2. We find 


Zi ZB 


W, = = (Z1Z2)"~ "(21 — Ze). 








n-1 n-1 
21 22 
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Evidently w, # 0 for all n. Hence the general solution of the difference 
equation is given by ¢c,z7 + C2Z3. 

5. If z, is a zero of multiplicity 2 of the characteristic polynomial, two 
solutions of the difference equation are given by 


bee ee a XP = ze, 
We find 
-1 
zt nz a 
see gig ce Fa #0; 
zt (n — 1)z3 
hence 


xq = Cet + enzt 
is the general solution. 
6. As a somewhat more special example, we consider the real solutions 
of the difference equation 
Xyo oe 21kgy + Xu = 0 (—1 <ft< 1) 
found in example 3. If ¢ = cos gy, we were able to express these solutions 
in the form x@ = cos np, x = sinng. For these solutions, 


COS Np sin np 
Ww, = 
: cos(n — l)p  sin(n — l)p 


cos np sin (n — 1)p — cos (n — 1)p sin np 
= —sing # 0. 


Thus the general solution can be expressed in the form 
C, COS NP + Ce SIN Ny. 


(Note, however, the restriction on ¢.) 


We are now in a position to solve arbitrary initial value problems for 
the linear difference equation #X = 0. All we have to do ts to find two 
particular solutions X, X® with nonvanishing Wronskian and to 
determine the constants c,; and c,. such that the sequence X¥ = c,X‘ 
+ c,X™ satisfies the given initial conditions. 


EXAMPLE 
7. Let us find the solution of 
Xn = Xn-1 + Xn-2 


satisfying the conditions x_,; = 0, x» = 1. By the examples 1 and 4, 
the general solution of the difference equation is 


n= af) + of 4) 
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The initial conditions yield the following conditions on c, and cg: 


C7, tc, = 1, 
14+ V75\7? 1— V5\-3 
i es) ara ee ee 
We easily find the solution 
_14+Vv5 _1- V5 


Cy = 


————; Co = 
2v'5 ° 2V5 
thus the solution of our initial value problem is given by the formula 
jt bees (sy - (iy) 
BAS 2 Z 

It is not immediately evident from this representation that the numbers x, 
all are integers. The sequence thus defined is called the Fibonacci 
sequence. It has many interesting number-theoretical properties. 


Problems 


1. Express in real terms a general solution of the following difference 


equations: 

(a) Xn — Xn-2 = 0; 
(b) Xn + Xn-2 = 0; 
(c) Xn + 2Xn-1 + Xn-2 = O. 


2. Find a general solution of the difference equation 
Xn + 21X,e4 + Xn-2 = 0. 


3. Let a, b be real, a? — 46 < 0. Show that the general solution of the 
difference equation 


Xn + AXn-1 + bDXn-2 = 0 
can be expressed in the form 
Xn = r™(c1 COS np + C2 Sin np), 
where r and 9 are real. 
4. Let X = {x,} denote the Fibonacci sequence. Show that 
naw © Xn Zz 


5*, Let b, c be real. Show that a necessary and sufficient condition in order 
that a// solutions of the difference equation 


Xn + 2bXn-1 + CXn-2 = 0 
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Figure 6.3 


tend to zero for n — © is as follows: The point (4, c) lies in the interior 
of the triangular region of the (4, c) plane bounded by the straight lines 
c= 1, 2b-—1l=ce, —2b-—l=c 

(see Fig. 6.3). (Hint: Treat separately the two cases where the zeros of 
the characteristic polynomial are real or imaginary.) 

6*. Formulate and prove results analogous to those of §6.3 for linear difference 
equations with variable coefficients. 

7. Solve the following initia] value problems for difference equations: 

(a) Xn = 3xXn-1 —Xn-2, X= 1, x1 = 2; 
(b) Xn — 2Xn-1 + 2Xn-2, X6.= X= 1, 


6.4 Linear Dependence 

The results of §6.3 can be formulated somewhat more elegantly by 
introducing the concepts of linear dependence and linear independence 
of sequences. Two sequences X¥‘” and X‘ defined on a set J of integers 
(but not necessarily solutions of a difference equation such as #X = 0) 
are called linearly dependent if there exist two constants c;, C2, not both 
zero, such that the sequence 

X= 6X + co, X¥® 

is the zero sequence on J. If no such constants exist, the sequences are 
called linearly independent. 
EXAMPLES 
8. Let J be the set of integers from | to 20 and let 


YO S01 tec: 12 -0)025.,.0} 
——_,-—_ —_~., 


10 elements 10 elements 
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X¥® = 0,0,...,0, 11,...,B 
The sequences X‘ and ¥ are independent. For assume that 
cx? oo Con? = 0 


forn = 1,...,20. Forl $7 S 10 this implies c, = 0, for 11 S 2 S 20 
it implies cp = 0. Thus c,X + c,.X¥® = 0 is possible only for 
C1 = C3 = 0: 

9. If one of the sequences X“ and X™ is the zero sequence, the two 


sequences are always linearly dependent. 


To find out whether two given sequences are independent may be 
difficult in general. However, if the two sequences are both solutions of 
the same linear difference equation # X = 0, their linear dependence or 
independence is closely connected with their Wronskian determinants. 

Theorem 6.4 Let X¥% and X® be two solutions of #X = 0, and 

let W = {w,} be the sequence of their Wronskian determinants. 

(i) If X and X@ are linearly dependent, then W is the zero 
sequence. 

(ii) If w, = 0 for some integer m, then the two solutions are linearly 
dependent. 


An immediate consequence of theorem 6.4 is the following 


Corollary 6.4 If w,, 4 0 for some integer m, then w, # 0 for all n, 
and the solutions X” and X™ are linearly independent. 


Proof of theorem 6.4. (i) If X and X™ are linearly dependent, then 
there exist, by definition, two constants c, and c., not both zero, so that 


In particular, for an arbitrary integer 7, 
CX + Cox = 0, 
rie, oan + Cax's a 

This is a homogeneous linear system of two equations with the nontrivial 
solution (c,, C2). Its determinant must therefore vanish, showing that 
w, = 0, as required. 

(ii) Let m be an integer such that w,, = 0. The homogeneous system of 
two linear equations with two unknowns, 


CP 4H egx? = 0, 
1 2 ram 
Ox, + Coxe 4 = 0, 


then has a nontrivial solution (c,, C2). We define the sequence 


»¢ = ga” + CoX @), 
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By lemma 6.3a, X is a solution of #&X = 0. By construction, this 
solution is zero at the two consecutive points m = m and n =m — |. 
Hence, by lemma 6.3b, X is the zero sequence. It follows that ¥“ and 
X®) are linearly dependent. 

If A and B are any two sequences, and if c,, c. are any two constants, 
the sequence c,A + c2B is called a linear combination of the sequences A 
and B. Using the concept of linear dependence, the content of theorem 
6.3 can now be phrased more elegantly as follows: Every solution of 
LX = 0 can be expressed as a linear combination of two fixed, linearly 
independent solutions. 


Problems 


8. Show that if two sequences are linearly dependent, one is a constant 
multiple of the other. 

9. Let W = {wn} be the sequence of Wronskian determinants of two 
solutions of #X = 0. Show that W satisfies the first order difference 
equation 

Wn = Qqwn-1. 


Hence give an independent proof of the fact that either W is the zero 
sequence, or w, # 0 for all n. 

10. Let X = {x,} be a solution of #X = 0 such that x, # 0 for all nx. 
Show that any solution Y = {y,} of the linear difference equation of 
order 1 with variable coefficients, 


Xn 
Xn-1 





Va = Yn-1 + a3 


satisfies £& Y= 0, and that the solutions X and Y are linearly 
independent. 

11*. Extend the results of §6.4 to linear difference equations with variable 
coefficients. 


6.5 The Non-homogeneous Equation of Order Two 


We now shall discuss the solution of the equation “X = B or, more 
explicitly, 
(6-10) Xp + AyXq-1 + AgX_-2 = bn, 
where the sequence B = {b,} is defined on some set J of consecutive 


integers. Much of the corresponding theory for differential equations 
again carries over. For instance we have the following result. 


Theorem 6.5 Let Y be a special solution of # Y = B, and let X¥ 
and X be two linearly independent solutions of the corresponding 
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homogeneous equation #X = 0. Thenevery solution X¥ of fX = B 
can be expressed in the form 


X= eX + coX@ + Y, 
where C,, Cz are suitable constants. 

Proof. By hypothesis, if Y = {y,} 
(6-11) Yn + AiYn-1 + G2Yn-2 = Dp 
for allnel. If X = {x,} is any solution of (6-10), we obtain by sub- 
tracting (6-11) from (6-10) and setting D = {d,} = X — Y, 

dy, + aydy-1 + Aedy-2 = O. 
Thus, the sequence D is a solution of the homogeneous equation. By 
theorem 6.3 it can be written in the form 

D=eyX™ + cgoX™ 


for suitable c,, cz. Since XY = Y + D, the desired result follows. 

Theorem 6.5 reduces the problem of finding the general solution of the 
nonhomogeneous equation to the problem of finding one particular 
solution of it. As in the case of differential equations, such a particular 
solution can frequently be found by an inspired guess. For instance, if 
the elements of B are bg”, where b and g # 0 are constants, a special 
solution of #X = Bcan frequently be found in the form x, = ag”, where 
ais aconstant to be determined. Substituting into (6-10) yields 


a(q" + a,q"~+ + ang” *) oe bq", 


or, denoting by p the characteristic polynomial of the homogeneous 
equation and dividing by gq’, 


aq~*p(q) = 6. 
If p(q) # 0, we find 
1a wb. 
pay 
the method breaks down, however, if q is a zero of the characteristic 


polynomial. 

Similarly, if the elements of B are polynomials in n, there often exist 
solutions X whose elements are polynomials of the same degree in n. 
These solutions can be found by the method of undetermined coefficients. 
Again the method may break down in exceptional cases. 


EXAMPLE 
10. To find a particular solution of 


a 2 
Xn — Xn-1 — Xn-2 = TM. 
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We set x, = an? + bn +c, with constants a, b, c to be determined. 
Substitution yields 


a[n’? — (n — 1)? — (n — 2)?] + dfn —- (2 — 1) - (xn — 2)] 
+cef1 -1-1] = —n? 
or, after simplification, 
a(—n? + 6n — 5) + B(—n + 3) + c(-1) = —7?. 
Comparing coefficients of like powers of n, we get 


—a = —l, 
6a — b = Q, 
—S5a+3b-—c= 40, 


yielding a = 1,5 = 6,c = 13. Thus we have found the solution 


X, =n? + 6n + 13. 


Problems 


12. Find general solutions of the difference equations 


(a) Xn — S5Xn-1 + 6Xn-2 = 2; 
(b) 8xn — 6Xn-1 + Xn-2 = 2"; 
(c) Xn — Xn-1 — 2Xn-g = 1. 


13. Find the solution of equation (a) above that satisfies the initial condition 
Xo = 1, 14> 1. 
14. Determine a general solution of the difference equation 


Dx = 3Xn-1 = 2X, 25 +3=0. 


What relation must hold between the values xo and x, of a particular 
solution {x,} of the above equation in order that |x,| S C for some 
constant C and for alln > 0? 

15. Difference equations have remarkable applications in the theory of 
economics. Let y, be the national income in the year 7, a the marginal 
propensity to consume, and 5b the ratio of private investment to increase 
in consumption. Assuming that government expenditure is constant 
and equal to 1, a certain economic theory states that the following differ- 
ence equation for y, holds (see S. Goldberg [1958], p. 6): 


Yn = 4Yn-1 + ab Yn-1 — Yn-2) + 1. 
(a) Solve this equation with a = 0.5, b = 1 under the initial condition 
Yioss = 2, Yios7 = 3. 


(b) How frequent are depressions in this economy? 
(c) For what values of a, 5 is the economy thus described noninflationary ? 
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6.6 Variation of Constants 


The methods discussed in §6.5 for finding a particular solution of the 
nonhomogeneous equation are, at best, heuristic and furnish a solution 
only in special cases. With differential equations, the method of variation 
of constants permits us to find a solution of any nonhomogeneous equation 
if the general solution of the homogeneous equation is known (see 
Coddington [1962], p. 67). A similar formula exists for difference 
equations. For simplicity we shall consider the case in which the sequence 
B is defined on the set of nonnegative integers. 


Theorem 6.6 Let X¥ = {x2} and X¥@ = {x?} be two linearly 
independent solutions of the homogeneous equation #X = 0, and 
let W = {w,} be the sequence of their Wronskian determinants. If 

= {b,} is defined on the set of nonnegative integers, a solution X of 
LX = Bis given by 


x x?) 








ae a 


1 


IV 


bins na 0. 


(6-12) = > ee 


Proof. Formula (6-12) can be verified by direct computation. A different 
proof, which shows how the formula is obtained, is as follows. 

It is clear that, for a given difference operator ¥, the elements x, of the 
required solution are /inear functions of the preceding elements of the 
sequence B. We may thus write 


(6-13) y= > dn, mOms 
m=0 


where the coefficients d, ,, depend only on # and not on the particular 
sequence B. The requirement that the sequence X = {x,} defined by 
(6-13) satisfies YX = B leads to 
n n-1 n-2 
Xn or Q1Xn-1 + AgXn-2 = > din, mOm so ay > di, —1,mOm 3 a2 > dn —2, mPm 
m=0 m=0 m=0 


= Bas 


or, collecting factors of 5,,, 

n—-—2 

> (dion ar Q1dn—1,.m + Ad, - 2, m)Om 
m=0 


Tr (din.n-1 re Q,4,-1.n-1)bn-1 + dn nDn a by. 


This identity must hold for all n = 0, no matter how the sequence B is 
chosen. It follows that for each m, the coefficients of b,, on both sides 
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must be equal. On the right, there is a nonzero coefficient only for 
m =n. This leads to the relations 

(6-14a) ds + Qydy~1,m + Q2dn—2,m = 0, n>mt+ l, 

(6-14b) dan—-1 + M14,-1,n-1 = 9, 

(6-14c) dyn = 1. 

Formula (6-14a) shows that for a fixed value of m, the sequence D = {d,, ,,} 
is a solution of YD = 0 forn > m+ 1. Equation (6-14c) yields a first 
initial condition dy». = 1. If we impose the further initial condition 
dn—1.m = 0, then (6-14b) can be replaced by (6-14a), where n = m + 1. 
The sequence D is thus characterized as solving #D = 0 for n > mand 
satisfying the initial conditions 

(6-15) dn-1.m = 9, din.m = 1. 

By theorem 6.3 there must exist constants cj, c®, depending only on m, 


such that 


The initial conditions (6-15) are satisfied if 


Cat + Cn Xm = I, 
COX) + Cx, = 0. 


The solution of this system of linear equations is readily found to be 


(2) (1) 
Pas ei eat PAC ee 
nm mm 
Wm Wn 
We thus find 
(1) 4-(2) (2),.(1) 
Xn Xm-—1 — Xn Xm 
(6-16) dn. m = n m-1 n m 1, 


Wm 
Substituting this into (6-13), we obtain (6-12). 

Although theorem 6.6 does provide a solution to the nonhomogeneous 
equation that works in all cases, there is of course no guarantee that the 
sum appearing in (6-13) can be expressed in any simple form. (Neither 
can the integrals appearing in the variation-of-constants formula for 
differential equations always be expressed in closed form.) But formula 
(6-13) can be effective in cases where heuristic methods fail. 


EXAMPLE 
11. To find a particular solution of 


(6-17) X= 2X6 + Xe, 


By example 2, two linearly independent solutions of the homogeneous 
equation are given by 


Ke Sl x =n, 
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Their Wronskian determinants are, by example 5, w, = —1. Formula 
(6-13) thus yields in view of b, = 1 (m = 0, 1, 2,...) 


1 n 


x,=- 








= Geri cm: 


acoll m— 1 


Changing the index of summation from m to p, where p = n+ 1 — m, 
we have 


n+l 


X= > pH=lt24+34+---+@4)). 
p=1 
By a well-known summation formula we obtain 


n 


¥ = tar? 


We leave it to the reader to verify that this indeed is a solution of (6-17). 


Problems 


16. Let a, b be any two constants, a # 0,5 # 0. Find a particular solution 
of the difference equation 


Xn am (a + b)Xn-1 + abxn-2 = a”, 


(Distinguish the cases a = b and a # Bb.) 
17*. Show that formula (6-13) is valid for linear difference equations with 
variable coefficients. 


6.7 The Linear Difference Equation of Order N 


The theory developed in the sections §6.3 through §6.6 for difference 
equations of order two carries over, without essential change, to equations 
of arbitrary order. We now consider the difference operator ¥ of order 
N defined by 


(6-18) (7X), = Xn 2 5 QyXn-1 eae a AnXn-N> 


where 4), @j,..., @y are arbitrary (real or complex) constants, ay # 0. 
We are interested in both the homogeneous equation 


(6-19) #X = 0, 
where 0 denotes the null sequence, and the nonhomogeneous equation 
(6-20) LX = B, 


where B denotes an arbitrary given sequence. 
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The following analogs of the results given above hold and are proved 
similarly. 


Lemma 6.3a’ Any linear combination of two solutions of #X = 0 
is again a solution. 


Lemma 6.3b’ Let J be a set of consecutive integers, let the integers 
m,m—1,...,m — N+ 1 bein J, and let the sequence B be defined 
on J. Then the difference equation “X = B has precisely one 
solution which assumes given values forn = m,m—1,...,m—-N+1. 


Again we have the corollary that if a solution of “XY = 0 vanishes at N 
consecutive integers, it vanishes identically. 

Let now the sequences X°? = {xP}, ¥@ = {(x@},..., X% = {x} be 
N solutions of #&X = 0. Their Wronskian determinant at the point 7 is 
defined by 


xD x ee XN) 
(1) (2) (N) 
Xn-1 Xn-1 r,s es | 
(1) (2) (N) 
Xn-N+1 Xn-nt+1 ‘°° Xn-Nn41 


The following result is obtained as in §6.3: 


Theorem 6.3’ Let X,..., ¥% be N solutions of /X = 0. Then 
every solution of #.X = Ocan be expressed as a linear combination of 
X,..., X™ if and only if the Wronskian determinant of these 
solutions is different from zero for all values of n. 


N sequences A‘, A®,..., A™ are called linearly dependent if there 
exist constants ¢1, Co,..., Cy, not all zero, such that 


is the zero sequence. If no such constants exist, the sequences are called 
linearly independent. As in §6.4 we can show: 


Theorem 6.4’ Let X°,..., X% be N solutions of AX = 0, and let 

W = {w,} be the sequence of their Wronskian determinants. 

(i) If the solutions X¥%,..., X“ are linearly dependent, then W is 
the zero sequence. 

(ii) If W contains a zero element, then the solutions X¥,..., X 
are linearly dependent. 


It follows that if w,, 4 0 for some integer m, then w, # 0 for all vn. 
Thus the condition of theorem 6.3 is satisfied if the solutions X¥™,..., X“ 
are linearly independent, and it follows that every solution of “XY = 0 can 
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be expressed as a linear combination of a system of linearly independent 
solutions. 

We now turn to the nonhomogeneous equation #X¥ = B. As above, 
we have 


Theorem 6.5’ Let Y be a special solution of “& Y = B, and let 
X®,..., X™ be a system of linearly independent solutions of the 
corresponding homogeneous equation #X = 0. Thenevery solution 
X of #£X = Bcan be expressed in the form 


X= O,XY +---+ cyX¥™ 4+ Y, 
where ¢,,..., Cy are suitable constants. 


Particular solutions of the nonhomogeneous equation can frequently 
be found by guessing the general form of the solution and determining the 
parameters such that the equation is satisfied. A more generally applicable 
method is given by 


Theorem 6.6 Let X? = {x9},...,X¥® = {x} be WN linearly 
independent solutions of the homogeneous equation #X = 0, and 
let W = {w,} be the sequence of their Wronskian determinants. If 
B = {b,} is defined on the set of nonnegative integers, then a solution 
X of £X = Bis given by 


xD x(2) vee XQ 
(1 2 N 
Xm-1 9 Xmua tt Xmen 
n (1) (2) (N) 
Xm-N+1 X%m-N+1 ‘'' Xm-N+1 
m=0 Wm 


Problem 


18*. Let W denote the sequence of the Wronskian determinants of N solutions 
of #X = 0. Find a linear difference equation of order 1 satisfied by W. 


6.8 A System of Linearly Independent Solutions 


To complete the discussion of the linear difference equation of order N 
with constant coefficients, we need to determine a system of WN linearly 
independent solutions of the homogeneous equation #X = 0. Special 
solutions of “#X = 0 can again be found by means of the characteristic 
polynomial 

P(z) = 2% + ayzN-* +--+ + ay 
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associated with the operator #. If z is a zero of the characteristic poly- 
nomial, it is easy to see that the sequence X = {x,} defined by x, = z" isa 
solution of “X = 0, for we have 


(LX), = Xp ah QyXn-1 i leis AnXn-N 
ae gn + az +. id + ayz-* 
= z™~Np(z) 


We now distinguish two cases. 

(i) The polynomial p has N distinct zeros Z,, Z.,...,Zy. Then the 
sequences X‘” = {z?}(k = 1, 2,..., N) represent N solutions of fX = 0. 
Their Wronskian at n = N — 1 is 


zN-1 gN-l 4... 0 gNo1 
zN-2 zN-2 1+. gN-2 
1 1 an | 
This determinant is known to have the value 
TI Gm — Zn) #0. 
m<n 


It follows that the solutions found are linearly independent. 

(ii) General case: The polynomial p has zeros of multiplicity > 1. 
The above method does not furnish sufficiently many solutions. How- 
ever, further solutions can be found by differentiation. Let z be a zero of 
multiplicity k + 1 where k > 0. We then have 
(6-22) P(z) = p'(z) =- ++ = p™(z) = 0. 

It was shown above that the sequence X¥ = {z"} is a solution. We now 
assert that also the sequences X¥‘, X¥,..., X‘® defined by 

xD =_ nz*>*, 

xp = n(n — 1)2"-?, 


x = n(n — 1)... —k + 1)z"-*, 
are solutions. Indeed, if 0 < m S k, then 
(LX), 


XM A aX He + AyXT? y 

n(n — 1)...€2 — m+ 1)z*-™ 

+ a(n — 1)(n — 2)...(2 — m)z™~™-1 

+ eo. 

+ ay(n — N\(n - N—1)...€n —-N—m+4+ 1)z™-No™ 
(z”-%p(zy)™. 
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By the Leibnitz rule for differentiating a product (see Kaplan [1953], 
p. 19), 


(2"~%p(z)™ = (5) — N\(n-—-N-—-1)...n2n-N—m + 1)z"-%- *p(z) 


+ (7)e — N)...(n — N—m + 2)z®-N-™+1y(z) 
+ eee 
m n—-Ny»(m) 
+ (emp, 
which vanishes for all n, by (6-22). 
We are thus led to the following analog of theorem 6.2: 


Theorem 6.8 Let the characteristic polynomial of the Nth order 

equation “XY = 0 have the distinct zeros 2, Zo,..., 2, (kK < N), and 

let the multiplicity of z, be m, +1 (m= N-—k). Then the 

sequences defined by 

(6-23) Xn = n(n — 1)...€2 —m + 1)z}-”, 
m=0,1,...,m;3 FS 2 en dnk 


form a system of N linearly independent solutions of #X = 0. 


We omit the proof of the fact that the N solutions given by (6-23) are 
linearly independent. 


EXAMPLES 
12. Let 
(LX), = Xn — 2Xp-2 + Xn-4- 


The characteristic polynomial 


p(z) = z* — 2274+ 1 = (z + 1)?(z — 1)? 


has the two distinct zeros z, = 1, z2 = —1, each with multiplicity 2. 
Theorem 6.8 thus yields the four solutions 
WW =1 xPan a= (-I x= (-1)™. 


13. We shall show: If z is a zero of multiplicity 4 of the characteristic 
polynomial, then the sequence X = {n°z"} is a solution. Proof: We seek 
to represent X as a linear combination of the solutions given by theorem 
6.8. We have 
n(n — 1)(n — 2) = n° — 3n? +7 
n(n — 1) = n?—n 
n= n 
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and hence 
n? = n(n — 1)\(n — 2) + 3n(n — 1) +: 22. 
Thus 


n§z" = z3[n(n — 1)(n — 2)z"-9] + 32z?[n(n — 1)z"-?] + z[nz*-*], 
which is a combination of the desired form. 
Example 13 leads us to conjecture the following fact: 


Corollary 6.8 Under the hypotheses of theorem 6.8 a set of N 
independent solutions of #X = 0 is also given by the sequences 


(6-24) x, = nz}, m=0,1,...,m; Ped, 2. fou. 


The proof that the sequences X defined by (6-24) are solutions boils 
down to showing that n” can be written as a linear combination of 


n,n(n — 1),...,2 —1)...¢2—m +1), 


with coefficients that are independent of n. This is readily verified by an 
induction argument. The proof that the N sequences given by (6-24) are 
linearly independent is again omitted. 


EXAMPLE 
14. To find the general solution of the difference equation 


Xa (as + eae aie (=1)"( jy) Xam = (0, 


The characteristic polynomial is 
As is known from the binomial theorem, 


PZ) = @ — 1). 


It follows that z = 1 is a zero of multiplicity N. Corollary 6.8 yields the 
solutions 1, 1, n?,...,n"~1, and the general solution can be written in the 
form 

Xn = Co + CN + Con? +--+ + Cy_yn® =). 


Problems 


19. Let N > 0 be an integer. Find a general solution of the difference 
equation 
Xn + Xn-1 + Xn-2 Hts + Xn-n = O. 
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Deduce that for any real angle « and any integer m, 1 S< mS N, if 
yg = 2nm/(N + 1), 


cosa + cos(a@ + y) +--:+ cos (a + Ny) = 0, 
sina + sin(a + gp) +--++ sin(a + Ng) = 0. 
20. Construct a difference equation that has the solutions 
(a) XxX, = 1, XxX, =.2", Xn = 3"; 
(b) , a F an; XS Hh: 
Also find a difference equation that has both the solutions given under 
(a) and (b). 
21*. LetO S ym < 27,k = 1,2,..., N,Q # 9. fork # J, and let the numbers 
a, be arbitrary. Show that the N sequences 


XM = (ollazt+noy) kk = 1,2,...,N 


are linearly independent. 


6.9 The Backward Difference Operator 


An important special difference operator of the kind defined by (6-18) 
is the backward difference operator, traditionally denoted by V (read 
“‘nabla”’ from the arabic word for harp) and defined by 


(6-25) (VX), = Xn — Xn-1- 


This difference operator is of order 1. It is closely related to the forward 
difference operator 4 which was introduced in §4.4. In fact, we have 


(4X), = (VX)n41- 


Thus every relation valid for the forward difference operator can also be 
expressed in terms of the backward operator, and conversely. In the 
framework of our present notations it is a little more convenient to work 
with the backward operator. 

Integral powers of the operator V are defined inductively by the relation 


(6-26) VEX = V(V*-1X), k= 2, 3yeiin 
EXAMPLE 
15. V?X = V(VX), hence 

(V?X)n ee (VX)n — (VX),-1 


= Xn — Xn-1 (Xn-1 ae Xn-2) 
Ba Se De es 


The example suggests an explicit expression for (V*X), in terms of 
binomial coefficients. In fact, 


(6-27) (VEX), = % — (ax # (oye ceed (-1)*(;) xa 
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The proof of (6-27) is by induction with respect to k. By definition, the 
formula is true for k = 1. Assuming its truth for k = m, we have 


(VE*EX), = (V"X), — (V"X)n-1 


= f= (pee Spat contd 
= [xaea = (Tea te (HD aan 
+ (—1)"%-n-1] 


By virtue of the identity (3-13) this expression simplifies to 


m+ 1 m+ 1 a a ee 
Xn — ( 1 Jan a ( s) Jira pent oe ( 1) m+ eee 


which is equal to the term on the right of (6-27) when k is replaced by 
m+ 1. Thus, (6-27) must be true for all positive integers k. 

In practice, the operator V is mainly used in connection with sequences 
F defined by 


Sn = (Xn), 


where x, = nh, h being a positive constant. One ordinarily writes Vf, 
in place of the logically more consistent, but cumbersome notation 
(VF),. The values of successive powers of V are conveniently arranged in 
a two-dimensional array, as in table 6.9a. 


Table 6.9a 
fa 
Vfn+1 
Sn+i V7fn+2 
Vin +2 V°fn+3 
Sn+2 V7fn+3 
Vfn+3 
fa+s 


Table 6.9a is called the difference table of the function f, constructed 
with the step A. Each entry in the table is the difference of the two entries 
immediately to the left of it. 
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EXAMPLE 

16. From Comrie’s table of the exponential function f(x) = e* (Comrie 
[1961]), which has the step 4 = 10-3, we can construct the following 
difference table (beginning at x, = 1.35): 


Table 6.96 
Xn Jn Wha V*fi 

1.350 3.857420 
0.003859 

1.351 3.861285 0.000004 
0.003863 

1.352 3.865148 0.000004 
0.003867 


1.353 3.869015 


The study of properties and applications of the difference table, in 
particular the relation between differences and derivatives of a function, 
will be one of our chief concerns in part IJ. Here we are content with 
stating the following fundamental fact. 


Theorem 6.9 Let the function f be defined on the whole real line, 
let kK be a positive integer, and let h > 0. If f, = f(nh), a necessary 
and sufficient condition in order that 
(6-28) Vif, =0 
for all integers n is that f(nh) = P(nh) for all n, where P is a poly- 
nomial of degree not exceeding k — 1. 

Briefly, the theorem states that the Ath differences of a function are 


identically zero if and only if, at the points where the differences are taken, 
the function agrees with a polynomial of degree <k. 


Proof. We have to show: (a) If the Ath differences are zero, then the 
values of f at x = nh are those of a polynomial of degree <k; (b) If the 
values of f agree with those of a polynomial P of degree <k at all points 
x = nh, then the kth differences are zero. 

To prove (a), we note that (6-28) means that the sequence F = {f,} is a 
solution of the difference equation 
(6-29) V«F = 0. 


By (6-27), the characteristic polynomial of this equation is 


ron (ore Gere ony 


= (z — 1)*. 
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This polynomial has a single zero, of multiplicity k, at z= 1. By 

corollary 6.8, any solution of (6-29) can thus be represented in the form 
Sn = Cy + Con + Cyn? +--+ 4+ e,n*-}, 

Clearly, f, = P(nh), where P is the polynomial defined by 


c 
P(x) = cy + Px + Batt tee xt, 


To prove (b), let f(x) = P(x) for x = nh, where P is a polynomial of 
degree < k, 


P(x) = agx*7* + ayx*®7? +--+ 4+ ay_y. 
We then have 
Sn = P(nh) = ao(nh)*~! + anh)? +--+ + ay_y. 


This relation shows that the sequence F = {f,} is a linear combination of 
the sequences X‘” defined by 


XO zn”, m=0,1,2,...,k -— 1. 


These sequences are solutions of the difference equation V*X = 0, by 
corollary 6.8. It follows that the sequence F is a solution of the same 
difference equation, that is, (6-28) holds. 


EXAMPLE 
17. Let P(x) = x* — 2x +1. The difference table with step h = 1 
(begun at x = 0) is shown in table 6.9c, 


Table 6.9c 
1 
0 —! 6 
5 5 42 © 49g 
22 17 ig fg 
57 35 4g 8g 
116 a 30 : 0 
205 6 
125 7; 


330 


Problems 


22. If T = T(t) denotes the sequence of Chebyshev polynomials introduced in 
example 3, calculate 


(a) (V?T@)n+1, (db) (V*T()n+2. 
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23. Give an explicit formula for the differences of the exponential function, 
f(x) = e*, and show that these differences are never zero. Explain the 
apparent contradiction with the numerical values given in table 6.9b. 

24. The first differences given in table 6.9b are very nearly equal to 10~° times 
the average of the adjacent function values. Is this an accident? 

25. Formula (6-27) expresses (V"X), in terms of X_, Xn-1,...,Xn-x~. Show 
conversely how to express x,_, in terms of (V°X)n, (VX)n,..-, (V*X)n. 


Recommended Reading 


A more thorough treatment of difference equations will be found in 
Milne-Thomson [1933]. Goldberg [1958] gives an enjoyable elementary 
account with many interesting applications. 


Research Problem 


Try to generalize as many results of this chapter as possible to linear 
difference equations with variable coefficients. Do not attempt to find 
solutions in explicit form. 


chapter 7 Bernoulli’s method 


With the present chapter we return to the problem of solving nonlinear, 
and in particular polynomial, equations. The methods discussed in the 
chapters devoted to iteration are very effective, but only if a reasonable 
first approximation to the desired solution is known. How to obtain such 
a first approximation is a problem which, for equations without special 
properties, is of such generality that it cannot be solved by generally 
applicable rules or algorithms. For polynomials, however, there do exist 
algorithms that furnish the desired first approximation using no other 
information than the coefficients of the polynomial. Two such algorithms 
—a classical one due to D. Bernoulli and one of its modern extensions due 
to Rutishauser—form the subject of this and the next chapter. Bernoulli’s 
method, in particular, is one which yields all dominant zeros of a poly- 
nomial. By a dominant zero we mean a zero whose modulus is not 
exceeded by the modulus of any other zero. 


7.1 Single Dominant Zero 


In chapter 6 we have seen how the general linear difference equation 
with constant coefficients can be solved analytically by determining the 
zeros of the associated characteristic polynomial. Bernoulli’s method 
consists in reversing this procedure. The polynomial whose zeros are 
sought is considered the characteristic polynomial of some difference 
equation, and this associated difference equation is solved numerically 
by solving the recurrence relation implied by it. From this solution it is 
easy to extract information about the zeros of the polynomial, as we 
shall see. 

To begin with the simplest case, let us assume that the polynomial of 
degree N, 


(7-1) P(Z) = 42% + ayzN~* +---+ ay, 
146 
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whose coefficients may be complex, has N distinct zeros 2Z,, Zg,..., Zy. 
What happens if we solve the difference equation 
(7-2) ApXn + AyX_-1 +-++ + AyXz_-y = 0 


which has (7-1) as its characteristic polynomial? According to §6.7, the 
solution X = {x,} (whatever its starting values are) must be representable 
in the form 


(7-3) Ng = CET HF CgZh ee F CyZh, 


where ¢,, Cg,..., Cy are suitable constants. To proceed further we make 
two assumptions: 


(i) The polynomial p has a single dominant zero, i.e., one of the zeros— 
we may call it z,—has a larger modulus than all the others: 


(7-4) Iz] >|zd, k=2,3,...,M. 


(ii) The starting values are such that the dominant zero is represented 
in the solution (7-3), i.e., we have 


(7-5) c, #0. 


We now consider the ratio of two consecutive values of the solution 
sequence X. Using (7-3) we find 


Xng1 _ C124) + C9794) +--+ + Cyzht? 
Xy CyZt + CgZ5 +--+ + CyZh 


By virtue of (7-5) this may be written 


n+1 n+1 
sea EE 2) 
(7-6) = Z, = a - 1 on - : 
Xn 1 + 2 (2) +e.4 (Ea) 
Cy \Z1 Cy \Z1 
By (7-4), |z,/Z:| < 1 for k = 2,3,...,.N. It follows that 
(7-7) (2) >0Q as n> 
21 


for k = 2,3,...,.N. The fraction multiplying z, thus tends to 1 as 
n-> oo, and we find 


lim Xn+1 


n+>0 Xp 





= 24. 


We thus have the following tentative formulation of Bernoulli’s method: 


Algorithm 6.1 Choose arbitrary values Xo, X_1,...,X-n41, and 
determine the sequence {x,} from the recurrence relation 


QyXn-1 + AgXn-2 f cae + AnXn-—N 
— 


(a ee ees 
Qo 


xX, = 
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Then form the sequence of quotients 


= Xn+1, 
(7-8) qn = = 


We have proved: 


Theorem 6.1 If the polynomial p given by (7-1) has a single dominant 
zero, and if the starting values are such that (7-5) holds, then the 
quotients g, are ultimately defined and converge to the dominant 
zero of p. 


EXAMPLES 
1. Applying Bernoulli’s method to the polynomial 
p(z) = 22 -—z—1 

with the starting values x» = 1, x_, = 0 yields the Fibonacci sequence 
considered in example 7 of chapter 6. Conditions (i) and (ii) are clearly 
satisfied here. The ratios of consecutive elements thus converge to the 
dominant zero z, = (1 + V5)/2. 
2. Let 

P(z) = 70z* — 140z* + 90z? — 20z + 1. 


The difference equation (7-2) (solved for x,) here takes the form 


140x,-1 — 90xn-2 + 20X,_3 — Xn-4 
x, =O 
70 
The first three columns of table 7.1 give the values of n, of the 
sequence x, (started with x) = 1, x, = 0, n < 0) and of the sequence 
{qn}. (The other columns will be explained in §7.2.) 
The preceding is a mere outline of Bernoulli’s method in the simplest 
possible case. A number of complications have yet to be dealt with, such 
as the following: 


1. Slow convergence. 

2. Zeros of multiplicity > 1. 

3. Unfortunate choice of initial values. 
4. Several dominant zeros. 

5. Calculation of nondominant zeros. 


These questions will be dealt with in the subsequent sections. 


Problems 


1. Find, by Bernoulli’s method, the dominant zero of the polynomial 
p(z) = 3223 — 48z? + 18z — 1 
to three significant digits. 
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Table 7.1 
A 2 
" ity ao Aq Aan -oe qh 
0 1 
1 2 
2 2.7142857 , 
4 33530611 1-9668831 99499395 0.0417791  —0.1214233 0.9454595 
5 34122447 10176506 pn 999796 00199619 —0.0792746 0.9383760 
6 3.3725945 9:9883800  __p g194639 9.0108076 + —0.0536856 0.9346444 
8 3.1331066 9-9578036  __g oogigs3 0.003948! 
0.9496383 0.0025560 
9 2,9753180 — 0.0056093 
10 28087866 99440290 _ 9 9939952 __(0:0017041 
0.9401238 0.001 1603 
11 2.6406071 —0.0027449 
12 2.4752493 9-9373789__g 9939425 —9-0008024 
13 2,3154382 9:9354364 _9 9913814 0.000561! 
15 20179923  9:9330692 _ p og97052 _9.0002806 +—0.0024890 0.9305802 
16 1.8815034 Peri —0,0005054 9-0001998 
17 1.7532951 9.93185 
0.93057 0.93057 
2. Use the value obtained in problem 1 to obtain the dominant zero to six 


significant digits, using either Newton’s or Steffensen’s method. 
. Look up the closed formula for finding the zeros of a cubic polynomial and 


use it to determine the dominant zero of the polynomial of problem 1. 


in theorem 7.1? 


where infinitely many of the quotients g, are undefined. 


. What is the meaning of the phrase “‘the quotients are ultimately defined”’ 
Give an example of a polynomial violating condition (7) 


7.2 Accelerating Convergence 


Even if the conditions of convergence of Bernoulli’s method are satisfied, 
the speed of convergence may be slow. By this we mean that the error of 
the approximation x, .,/x, to the zero z,, 


x 
d,=—ti-z, 


nr 
tends to zero only slowly. As in the case of iteration, it may be possible to 
speed up convergence by making judicious use of information about the 
manner in which d, approaches zero. In order to discover this manner 
of convergence, we shall analyze the errors d, more closely. We shall 
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continue to make the assumptions (7) and (ii) of §7.1. In addition, we 
assume that 


(7-9) [z1{ > |ze| > |zxl, b= 334.4. g NN 

(the next-to-dominant zero is the only zero of its modulus), and that 

(7-10) C2 #0 

in (7-3) (the next-to-dominant zero is represented in the solution {x,}). 
Under these hypotheses, the error 

CyZt** + Cozgt* +++ + Cyzyt* — 21(c1z7 +--+ + cyZh) 


dad, = 
" C121 + CoZ5 + ~~ o + CnZn 


Co(Zo — 21)Z8 +++° + Cy(Zy — 23)Zz5 
CyZq + CoZB +++ + CyZy 


can be written in the form 


(7-11) n = Aft™(1 + «,), 
where 
Aa Maa) 22, 
Cy Z1 
and 


ee Ca(Z3 — 21) (2)" ie iaeat Cy(Zw — 21) (2) 


Co(Z2 — 21) \Ze Co(Z2 — 21) Zq 
n N 
1+ 2 (2) +e (Fa) 
Cy \Z1 Cy \Z1 


By virtue of condition (7-9), we have, in addition to (7-7), 


l+e,= 


2)" —>Q as n>o 

22 

for k = 3,4,..., N, and the ratio on the right has the limit 1 as n > o. 
It follows that 


(7-12) lim ec, = 0. 


N— 


As a consequence, 


dn +1 at 1 + eny1 = 
ae Sa ee t(1 + 5,), 





where 


oy 
"+»>Q as n—->oo. 
l+e, 
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The errors d, thus satisfy precisely the condition (4-12) of theorem 4.5 
which makes Aitken’s 4? process effective. We thus have as an immediate 
consequence of theorem 4.5: 


Theorem 7.2 Under the hypotheses stated at the beginning of the 
present section, the sequence {g;,} derived from the sequence {q,} by 
means of Aitken’s 4? formula, 
hon (4q,)° 
Gn = Wn — Aq, 
converges faster to the dominant zero z, than the sequence {q,,} in the 
sense that 


lim #— 22 9 

n+0o Gn — 21 
EXAMPLE 
3. In table 7.1 the values g, are shown in the last column. Also given 
are the intermediate values required to calculate gj. The faster conver- 


gence of the sequence {g,} to the exact zero z, = 0.93057 is evident. 


Problems 


5. Find, by Bernoulli’s method speeded up by the 4? process, the dominant 
zero of the polynomial 


P(z) = z* — 7z? + 13z? — 8z 4+ 12. 
6. Apply the 4? method to the sequence {q,} obtained in problem 1. 


7. Explain why the 4? method does not speed up appreciably the Bernoulli 

sequence for the polynomial 
pP(z) = z? — 4z? + 6z — 4. 
(z = 2 is the sole dominant zero.) 

8. If suitable conditions are satisfied, the error of the sequence {q,} generated 
by Bernoulli’s method tends to zero like (z2/z,)"._ How does the error 
of the accelerated sequence {g;,} tend to zero? What do you conclude 
about the number of steps necessary to achieve a given accuracy 
(a) if |zi| ~ |ze| > |zsl, 

(b) if |zi| > |z2| ~ |zal? 


7.3 Zeros of Higher Multiplicity 


If the polynomial p has repeated nondominant zeros Z2,..., Zy, then 
formula (7-3) for the general solution of the difference equation (7-2) also 
contains (by corollary 6.8) terms like n*z3. Thus the expression (7-6) 
contains terms like 

n'(2) in addition to (2) 
Zz Zz 


1 1 
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However, these terms cannot disturb the convergence of the method, since 
if |g| < 1, then we have not only g" > 0 as n > © but also ng" ~ 0 for 
any fixed value of k. 

The situation is different if the dominant zero has multiplicity >1. 
(We still assume that there is only one dominant zero.) To fix ideas, let 
the multiplicity be 2. Relation (7-3) then takes the form 


Xn = CynZT + CoZt + €3Z75 +°°-, 


where |z,| < |z,|. Thus the ratio (7-6) now becomes (still assuming 
c, #0) 


eh) Xn CyNZT + Coz + C323 +°°° 


i eee: Sees (2:)"" a ee 
iy (n + ley + C2 (n + lc, + ce \Z, 


= “41 AC, + Ce l i C3 (2)" 
NC, + Cg \2Z, 
Convergence still takes place, but, due to the factor 
(n + 1)c, + ce 1 | 
UW _=- = + a 
nc, + Ce n+ CofCy 


at a much slower rate; the error after n steps is now of the order of 1/n as 
against (z,/z.)" if the dominant zero has multiplicity one. An example for 
this slowdown of convergence, as well as a possible remedy, is given in the 
following section. 


Problems 


9. Devise an Aitken-like acceleration scheme for the case of a single dominant 
zero of multiplicity 2. [Sketch of a possible solution: From (7-13) we 
find for the error d, = qn — 21 


= C12, + O(2") 
"en + Co + Ot") 


where t = Z3/z;. Neglecting O(f") and setting c = c2/c;, we have 


21 
(7-14) Qn — 21 = nee 
The unknowns c and, if desired, m can be eliminated from consecutive 
relations (7-14) as in the derivation of Aitken’s formula, yielding a 
formula for z;.] 
10. Apply the procedure devised in problem 9 to the calculation of the 
dominant zero of the polynomial 


p(z) = z* — 4z3 — 2z7 + 1274+ 9. 
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7.4 Choice of Starting Values 


One of the conditions for convergence of algorithm 7.1 was that c, # 0. 
It can be shown by means of complex variable theory that this condition is 
always satisfied if the starting values are chosen as follows: 


(7-15) Xone. = X-nen Ste = X_1, = 0, Xo = 1. 


A different, more sophisticated choice of starting values is defined by 
the following algorithm: 


Algorithm 7.4 If the coefficients ao, a;,..., @y are given, calculate 
Xo) X1, +++) Xw—1 by the formulas 
ay 
Xo = 7 
0 Ao 


| 
xy aie (2az + a\Xo), 


l 
7. (3a + AxXo + a,X,), 
ao 


* 
o 
lI 


and generally 


] 
(7-16) x, = -s (A + Wasa + AX0 + Ay 1X1 +++ + A X_-1), 
hea 2g dV 1s 


The starting values generated by algorithm 7.4 have the following very 
desirable property: 


Theorem 7.4 Let the polynomial p(z) = aoz% + a,zN¥~1 +---+ ay 
have the distinct zeros 2,, Zo,...,Zy4 (@M S N), and let the multi- 
plicity of z; be m, (i= 1,2,..., M). If the starting values for 
Bernoulli’s method are determined by algorithm 7.4, then relation 
(7-3) takes the form 


(7-17) Xn = myZtt + mezett tee + myzie}, 1 a | eel aaa 


The proof is again most easily accomplished by complex variable theory, 
and is omitted here. Relation (7-17) is remarkable for the fact that no 
powers of n appear, notwithstanding the possible presence of zeros of 
multiplicity higher than 1. The difficulty mentioned in §7.3 thus can 
always be avoided by a proper choice of the starting values. The ratio 
Xn+1/Xn then converges at a rate determined only by the magnitude of the 
two largest zeros. 
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EXAMPLE 
4. We compare the sequences {q,} for the polynomial 


P(z) = (z — 3)°(z + :1+)? = 2* — 423 — 22? + 127 4+ 9, 


generated by starting Bernoulli’s method by (7-15) and by algorithm 7.4. 
The recurrence relation is in both cases 


Keo 4x5 2X g = 2X ca Xe 


Table 7.4 Method (7.15) Algorithm 7.4 
n Xn Gn Xn Qn 
—3 0 
—2 0 
—j 0 
0 1 4 
4 5 
: 3.5 2.6 
3.78 3.15385 
3 68 164 
3.69 2.95122 
4 251 484 
3.54 3.01653 
5 888 1 460 
3.46 2.99452 
6 3 076 4 372 
3.40 3.00183 
7 10 456 13 124 
3.35 2.99939 
8 35 061 39 364 
3.32 3.00020 
9 116 252 118 100 
3.29 2.99993 
10 381 974 354 292 
3.26086 3.00002 
11 1 245 564 1 062 884 
3.24000 2.99999 
12 4 035 631 3 188 644 
3.22222 3.0000025 
13 13 003 696 3.20690 9 565 940 79999992 
14 41 701 512 ; 28 697 812 ; 
3.00000 3.0000000 


Problems 


11. Use algorithm 7.4 to obtain starting values for the application of Ber- 
noulli’s method to the polynomial of problem 10. Verify relation (7-17). 

12. Using Vieta’s formulas, verify (7-17) form = 0 anda = 1 in the case of an 
arbitrary polynomial. 

13. Show that if the rational function 


a, + 2aez +:++++ Nayz®7} 


(2) = -———— 
(2) Ao + ayzZ + Agz? ++°++ Anz% 


is expanded in powers of z, then the coefficients of 1,z,...,z%~1 are 
identical with the numbers xo, x1,..., Xv-1 defined by algorithm 7.4. 
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7.5 Two Conjugate Complex Dominant Zeros 


The theory developed so far holds equally well whether the coefficients 
of the polynomial p are real or complex. However, we have always 
assumed that z, is the sole dominant zero of p. Let us now consider the 
case where p is a polynomial with real coefficients which has a pair of 
complex conjugate dominant zeros z, and zz = Z,, both of multiplicity 
one. The remaining zeros shall satisfy 


(7-18) {z,,| < Iz, |, k aad 3, 7 are N; 


for the purpose of our analysis we assume that the nondominant zeros, 
too, have multiplicity one, although this is not essential for the result. 

If the starting values for the sequence {x,} are real, equation (7-3) takes 
the form 


Xn = CZ + CyZ7 + cgZ3 +++ + CyzZy. 


Representing the complex numbers c, and z, in polar form, we write 
Zz, = re'?, c, = ae’®, where r > 0 anda > 0. We may assume, further- 
more, that z, is the zero in the upper half-plane, and consequently 
0 <<a. The expression for x, now becomes 


xX, = 2ar" cos (np + 8) + c3Z§ +++-++ CyZh, 
This may be written 


(7-19) X, = 2ar"[cos (np + 8) + 4], 


_ £3 (73\" __... 4 en (2n)\" 
a, = 52 (2) bs +5 (7) 


where 


Im z 





Re z 


z 


Figure 7.5 
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and hence, for some suitable constant C, 
|@,| S$ Ce, 


where ¢ denotes the largest of the ratios |z,/r], kK = 3,..., N. In view of 
(7-18), the number ¢ is less than 1, and hence 
(7-20) lim 6, = 0. 

Our problem is to recover the quantities r and » from the sequence 
{x,}. To find its solution, let us begin by assuming that 0, = 0. We 
recall that the sequence {x,} with the elements 


xX, = 2ar" cos (np + 8) 
is a solution of the difference equation 
(7-21) Xn + AxXn-1 + Bxz_-2 = 0, 


where B = r?, A = —2rcos 9 (see problem 3, chapter 6). To determine 
the coefficients A and B from the known solution {x,}, we observe that 
equation (7-21) together with the corresponding equation with n increased 
by 1, 

Xn+1 + AX, + Bx,-1 = 0 


represents a system of two linear equations for the two unknowns A and B. 
The determinant 


5 nae ae 
(7-22) D,=| : 








Xn Xn-1 
equals, by a trigonometric identity, 

4a?r2"- 2{cos? [(n — 1)p + 8] — cos (np + 5) cos [(n — 2)p + S]} 

= 4q°r2"*2 sin? o, 


and hence is different from zero, as 0 < g < 7. We may thus solve for 
A and B, finding 





_ EL => Dy +1 
A= D,. B= D,, , 
where 
Xn Xn-2 
(7-23) E, = : 
Xn+1  Xn-1 








The desired quantities r and g can now be found by 


E,, 


— D 
7-24 r= VB = jo cos = eC FS ===: 
ey D, P Or VID. Daas 
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The relations (7-24) solve our problem if {x,} is a solution of the 
difference equation (7-2) such that 6, = O0in (7-19). If {x,} is any solution 
of (7-2), 6, £ 0 in general; however, using (7-20) it is not too hard to show 
that 

Drei 2 E,, 
(7-25) D, >r’, ap, 7 oO? 
as n-—> 00. We thus have obtained the convergence of the following 
algorithm for the determination of a pair of complex conjugate dominant 
Zeros. 





Algorithm 7.5 From the solution {x,} of the difference equation 
(7-2), calculate the determinants D, and E, defined by (7-22) and 
(7-23). With these, form the sequence of ratios D,.,,/D, and 
E,,/2Dp. 
The corresponding convergence theorem is as follows: 
Theorem 7.5 Let the polynomial (7-1) have exactly two dominant 
ZerOS Z; 2 = re*'®, both of multiplicity one, whereO < » < 7. Ifthe 
solution of the difference equation (7-2) is such that c, # 0 in (7-3), 
then the limit relations (7-25) hold. 
EXAMPLE 
5. For the polynomial 
p(z) = 81z* — 108z% + 24z + 20 
(exact dominant zeros: z,,. = 1 + i$) the computation proceeds as in 
table 7.5. It is evidently unnecessary to calculate the determinants D, 
and E,, from the beginning. 
Algorithm 7.5 has the disadvantage of not being very accurate when 9, 
the argument of the dominant zero, is small. Let us assume that we have 


determined r accurately and x = rcosg = lim £,/2D, with an error dx. 
We then have to calculate » from the relation 


x 
cos p = ? 
and find for the differential of » 


1 
dp = Fr sin 9 dx. 


For small values of » this may be quite large even for a small value of dx. 


Problems 
14. Determine the dominant zeros of the polynomial 
P(z) = z2 — 27 +2 
to three decimal places. 
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Table 7.5 


Xn 


1 

1.333333 
1.777778 
2.074074 
2.123457 
1.975309 
1.580247 
0.965707 
0.178022 


D, 


Dn+i/Dn 


En 


E,/2D, 


— 0.718589 
— 1.634438 
— 2.470444 
— 3.124966 


— 3.504914 
— 3.537670 
— 3.180991 
— 2.431232 
— 1.328033 


0.045304 
1.566200 


1.229324 
1.366009 
1.517807 
1.686428 
1.873816 
2.082018 
2.313358 


1.111187 
1.111125 
1.111095 
1.111115 
1.111111 
1.111113 


1.111111 


2.458740 
2.732037 
3.035585 
3.372868 
3.747630 
4.164035 
4.626705 


1.000037 
1.000006 
0.999990 
1.000003 
0.999999 
1.000000 


1.000000 


15. Starting from the exact representation (7-19) of x,, prove the relations 
(7-25). 

16. Devise a modification of Bernoulli’s method for the case that the poly- 
nomial p has two real dominant zeros, both of multiplicity one, of opposite 
signs, and apply your algorithm to the polynomial 


P(z) = z* — 2z% — 22? + 8z — 8. 


7.6 Sign Wavest 


How can we detect a pair of conjugate complex zeros? Formula 
(7-19) shows that in the presence of a pair of conjugate complex dominant 
zeros the elements of the sequence {x,} behave like 


(7-26) xX, = 2ar" cos (np + §). 


This equation implies, in particular, that the signs of the x, oscillate. 
Moreover, by looking at the frequency with which these oscillations occur, 


t This section may be omitted at first reading. 
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we can hope to be able to make a statement about the angle g. Clearly, 
when 9 is small, the ‘sign waves”’ are long, and when ¢ is close to z, they 
are short. (In the extreme case, where p = a, plus and minus signs 
alternate in strict order.) 

In order to establish a more precise result, let us assume that the 
polynomial p has degree 2. Equation (7-26) then is exact; we now write 
it in the form 


\ 


(7-27) X, = 2ar” sin (ne + d+ 5)" 
If m were a continuous variable, the signs of x, would change from minus 
to plus whenever 

TW 


5 2kn, 


np + 6+ 


where k is any integer. Let now, for any positive integer k, n,, be the first 
integer following the Ath change of sign from minus to plus of the sine- 
function in (7-27), i.e., let 


2ka — § — 5 
(7-28) n, = ——————_ + 9, 
P 
where 0 S 6, < 1. This means that for each positive integer k, 
sign X,,-1 = —l, sign x,, =0 or I. 


Now subtract from (7-28) the corresponding relation for k = 0. We 
find, after dividing by k, 





MN, — No 27 | Oo, — % 
k ae oe 


We now consider the limit of this expression as k > oo. Since the @, are 
numbers between zero and one, 


lim “Gt = 0, 
It follows that 
TS im 
ko 


exists and has the value 27/p. Solving for », we find 


_ 2m, 


(7-29) P= 


The limit 7 appearing here has a very simple interpretation. The 


160 elements of numerical analysis 


integer 7, — n,_, represents the length or period of the Ath sign wave in 
the sequence of signs of the x,. By virtue of the identity 


Ne — No = (My — My) + Mea — Mea) H+ + (1 — No), 


the expression n, — no is the sum of the periods of all sign waves from the 
first to the kth. Hence T is just the average period of all sign waves. 

We thus have obtained the following algorithm for the determination of 
the arguments of a pair of complex conjugate dominant zeros. 


Algorithm 7.6 In the sequence {x,} generated by algorithm 7.1, 
delete everything but the signst of the x,. For k = 0,1,2,..., let 
n, be the index of the Ath + which is preceded by a —, and calculate 
the average value of all sign wave periods 7, — ny,_1. 


Although our analysis applies only to polynomials of degree 2, the 
following result can be proved for polynomials of arbitrary degree 
(Brown [1962]). 


Theorem 7.6 Under the hypotheses of theorem 7.5, the argument ¢ is 
given by (7-29), where 7 denotes the average sign wave period. 


In electrical engineering, the quantity 27/T is called the circular frequency 
of an alternating current with period 7. In this sense we can say that 9 is 
equal to the circular frequency of the sequence of signs of the x,. 


EXAMPLE 
6. For the polynomial 


p(z) = z* — 8z? + 39z? — 62z + 50 
(z? — 2z + 2)(z2 — 6z + 25) 


the sequence of signs of the x, turned out as follows: 


aoe Seer oe Ane a Se Se oa 
ee eae mea ea, 
ete ae ee ee era eat 
6 7 ca ee mee 


The average length of the periods of the full sign waves recorded above is 
6.75. We thus find the approximate value 


2a . 
ee i aa 
The exact value is 


y = arctg $+ ~ 53,1°. 


t A zero may be considered positive or negative. 
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It is only fair to point out that, in spite of the favorable showing in 
example 6, the convergence of algorithm 7.6 is rather slow in general. 
The algorithm is not useful for purposes of exact numerical computation, 
but it does give very quickly, with almost no computation, an idea about 
the location of a pair of complex conjugate zeros. 


Problems 


17. Bernoulli’s method is applied unsuspectedly to the polynomial 
p(z) = z* + 223 + 427 — 2z — 5, 


Show that the polynomial has a pair of complex conjugate dominant 
zeros. In which quadrants of the plane do they lie? 
18. Algorithm 7.6 fails for the polynomial 


P(z) = 23 — 2z7 + 2z - 1. 


Explain! [Hint: How many dominant zeros are there?] 

19. Devise an algorithm for removing a quadratic factor z? — uz — v froma 
given polynomial. [Hint: Write p(z) = (z? — uz — v)q(z) and compare 
coefficients of like powers of z, as in §4.10.] 

20. Verify that z? — 3.711245z + 3.728699 is (approximately) a quadratic 
factor of the polynomial 


P(z) = 2° — 3z* — 20z? + 6027 — z — 78 


and remove it from p by the algorithm devised in problem 19. 


Recommended Reading 


Bernoulli’s method, and some extensions of it, are dealt with by Aitken 
[1926]; other accounts are given in Householder [1953] and Hildebrand 
[1956], pp. 458-462. An entirely different procedure for determining the 
dominant zeros of a polynomial is known as Graeffe’s method. It is 
dealt with in the above standard numerical analysis texts; see also 
Ostrowski [1940] and Bareiss [1960]. A complete account of modern 
automatic procedures for polynomials is given by Wilkinson [1959]. 


Research Problems 


1. How does the presence of more than two dominant zeros manifest 
itself in the sequence of signs of the x,, and how can the arguments be 
recovered ? 

2. Suppose a polynomial has exactly three dominant zeros, one real, two 
complex conjugate. Devise a modification of Bernoulli’s method that 
deals with this situation. 
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Bernoulli’s method has the disadvantage of furnishing only the dominant 
zeros of a polynomial. If it is desired to compute a non-dominant zero by 
Bernoulli’s method, it is necessary first to compute all larger zeros, and 
then to remove them from the polynomial by the method of §4.10. Only 
rarely these zeros will be known exactly. Thus the successive deflations 
will have a tendency to falsify the remaining zeros. We now shall discuss 
a modern extension of Bernoulli’s method, due to Rutishauser, which has 
the advantage of providing simultaneous approximations to all zeros. 
Since the prerequisites for this volume do not include complex function 
theory, we are unable to provide the proofs for the convergence theorems 
in this chapter. But even though its theoretical background cannot be 
fully exposed, we feel that Rutishauser’s algorithm is of sufficient interest 
to warrant its presentation at this point. 


8.1 The Quotient-Difference Scheme 


The Quotient-Difference (QD) algorithm can be looked at as a general- 
ization of Bernoulli’s method. As in chapter 7, we are given a polynomial 


(8-1) P(Z) = aoZ% + ayzX~-1 +---4+ ay 

and form a solution of the associated difference equation 
(8-2) AoXn + AyXn-1 t+ + AyXq_n = 0. 

The sequence {x,} may for instance be started by setting 
(8-3) Xonei = Kiveo Sr a xX, = 0; Xo =uls 


In chapter 7 we now formed the quotients 


(8-4) In = ant} = g, 
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If the polynomial p has a single dominant zero, then, as was shown in 
chapter 7, the sequence {q,} converges to it. 

The elements of the sequence {q,,} will now be denoted by g{?; they form 
the first column of the two-dimensional scheme called the Quotient- 
Difference (OD) scheme. The elements of the remaining columns are 
conventionally denoted by e, q?, e, q@,..., eX ~», g™ and are gener- 
ated by alternately forming differences and quotients, as follows: 


(8-5a) a = (gi — q®) + eR, 
(8-5b) get) ie Cnr (k) 


eft) Qn+1> 


where k = 1,2,...,N —1, n =0,1,2,.... In (8-5a) we set e = 0 
when k = 1. The number of g columns formed is equal to the degree of 
the given polynomial. 


EXAMPLE 
1. For N = 4, the general QD scheme looks as follows: 
qs” 
0 ep 
a? a 
0 ep) el 
as? a? a” 
0 ep e?) eS) 
a? af? a af 
0 ep ef?) e's) 
a? a a? a 
: eb : e?) : eS) : 
Scheme 8.1 


In each column of the scheme the superscripts are constant, and in each 
diagonal the subscripts. The rules (8-5) can be memorized by observing 
that in each of the rhombus-like configurations shown in the scheme 
either the sums or the products of the SW and of the NE pair of elements 
are equal. Ifa rhombus is centered in a g column, sums are equal; if it is 
centered in an e column, products are equal. In view of this interpretation 
the formulas (8-5) are occasionally referred to as the rhombus rules. 


The QD scheme can be described in yet another way if we introduce, in 
addition to the forward difference operator already introduced in §4.4, 
the quotient operator Q defined by 


Xn+1 
Xx: = 
Ox, s 
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The relations (8-5) then can be written more compactly thus: 
(8-6) en = ene + Agn?, ant? = qn? Qen”. 
Here it must be understood that the operators 4 and Q act on the subscript. 


EXAMPLE 
2. We conclude this section with the numerical QD scheme for the 
polynomial 

P(z) = 27 -z-1. 


The sequence {x,} is the Fibonacci sequence (see example 7, §6.3). 


Table 8.1 
Xn e ra e) gq e?) 
1 0 
1.000000 
1 0 1.000000 
2.000000 — 1.000000 
2 0 — 0.500000 — 0.000001 
1.500000 — 0.500001 
3 0 0.166667 — 0.000001 
1.666667 — 0.666669 
5 0 — 0.066667 0.000002 
1.600000 — 0.600000 
8 0 0.025000 0.000025 
1.625000 — 0.624975 
13 0 — 0.009615 — 0.000049 
1.615385 — 0.615409 
21 0 0.003663 — 0.000171 
1.619048 — 0.619243 
34 0 — 0.001401 
1.617647 
55 0 
4 | 
1+ v5 ? 
2 
Problems 
1. Generate the QD scheme for the polynomial p(z) = z? + 5x? + 9z + 5. 
2. Let x, = n!,n =0,1,2,.... Determine the QD scheme corresponding 


to the sequence {x,} (a) numerically, (b) analytically. (The sequence {x,} 
does not arise as solution of a difference equation in this case.) 
3. Give analytical formulas for the entries of the QD scheme, if x, = 1 + g", 
where 0 < |g| < 1. Show that 
lim gf = 1, lim e? = 0, lim ¢? = q, 


nw © Neo Nh> Oo 


and that e@ = 0 for all x. 
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8.2 Existence of the QD Scheme 


Evidently the QD scheme fails to exist if a coefficient e with0 < k < N 
becomes zero, and it is easy to construct examples for which this actually 
occurs. Another, trivial, case of nonexistence of the scheme arises when a 
x, becomes accidentally zero. It appears to be difficult to state explicit 
necessary and sufficient conditions for the existence of the scheme in terms 
of the polynomial p. In terms of the sequence {x,}, a necessary and 
sufficient condition is that the determinants 


Xn Xn+1 °° Xnt+k-1 

Xn+1 Xn+2 aes Xn+k 
(8-7) H® = 

Xn+k-1 Xn+k ‘°'' Xn+2n-2 


should be different from zero for k = 1,2,..., Nand form = 0,1,2,.... 
It is possible to state simple sufficient (but not necessary) conditions for this 
to be the case. Among them are the following: 

(i) The zeros 2, Z2,...,Zy Of p are positive, and the sequence {x,} is 
started by algorithm 7.4. 

(ii) The zeros 2), Z2,..., Zy of p are simple (but not necessarily real) and 
have distinct absolute values: 


(8-8) [Zi] > [za] >-+-> [zw] > 0. 


In case (ii) we can assert only that H® 4 0, and consequently e # 0, 
for all sufficiently large values of n. 

There is a good deal of numerical evidence that the OD scheme exists in 
many cases even if neither of the above sufficient conditions is satisfied, 
for instance if p is a polynomial with real coefficients having pairs of 
complex conjugate zeros. 


Problems 
4. Show that 


(i) (2) (2) (1) 
gy? _ Anes ep — A (2) Ans iA 
1 
H® 


TS ertlrerey 2 Qn = 
AP HS, HP AR 





5. Show that the determinants H{” and H{ formed with the elements 
X, = 1 + q", where 0 < |qg| < 1, are always different from zero. 

6. Let x, = r"cos(np + 6), where 0 << 7. Show that the corre- 
sponding determinants H{” are always different from zero. What are 
the conditions on ¢ and 6 in order that Hi? # 0 for all n? 
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8.3 Convergence Theorems 


If the QD scheme exists, some remarkable statements are possible 
about the limits of its elements as n> 00. The simplest situation arises 
if the zeros of the polynomial p satisfy (8-8). We then have 


Theorem 8.3a Under the conditions just stated, 


(8-9) lim g=z, k=1,2,...,N, 


n- © 


i.e., the kth g-column of the QD scheme converges to the kth zero of 
the polynomial. 


It follows from (8-9) by virtue of (8-5a) that 


lim e? = 0, 


nl— 0 
and from this we get easily by induction 


(8-10) lime =0, k=1,2,...,N—1. 


n+ a 


Thus, under the condition (8-8), all e-columns of the QD scheme tend to 
zero. 


EXAMPLE 
3. In table 8.1 the column headed g tends to (1 — V5)/2, the smaller 
zero of p(z) = z? —z—1. (Concerning the column e%?’, see §8.4.) 


If several zeros of p have the same absolute value (this happens, for 
instance, every time when a polynomial with real coefficients has a pair of 
complex conjugate zeros), the convergence properties of the scheme are 
more complicated. We still assume that the zeros are numbered such that 


(8-11) |z,| 2 |ze| 2 [zal 2--- 2 |Zy| > 0. 


For convenience in formulating some of the conditions below, we shall put 
|Zo| = 00, [Zvi] = 0. 
Always assuming that the scheme exists, we then have 
Theorem 8.3b For every & such that |z,.4,| > [Z| > |Z,-a]; 
(8-12) lim q = z,. 


For every k such that |z,| > |z,4,[, 
(8-13) lim e@ = 0. 


n+ © 


These facts can be used in the following manner: The e columns which 
tend to zero (a behavior which is numerically conspicuous) divide the OD 


the quotient-difference algorithm 167 


table into subtables. All zeros z, whose subscripts agree with the super- 
scripts of the g’s in one subtable have the same modulus. Thus, if a z; is 
the only zero of its modulus, this will be evident from the fact that the 
corresponding subtable contains one g column only, and the value of z;, 
can be obtained as the limit of that g column. 

It is not yet clear how to deal with several zeros having the same 
absolute value. (Most frequently this situation occurs in connection with 
complex conjugate zeros of real polynomials.) Such zeros, too, can be 
obtained from the OD table. We first consider the general case where m 
ZETOS Zp 415 Ze429-++ Zk+m Nave the same modulus: 


(8-14) [Ze] > [Zeal = [Ze+2] =+°° = [Zeaml > [Zcameil- 

Here it is necessary to construct polynomials p\?,/ = k,k + 1,...,k +m, 
by means of the recurrence relations 

(8-15a) p&(z) = 1, n=0,1,2,..., 

(8-15b) POE) = apts P@) — aPrt-YE), 


| 1 a es Seam; ce 0 ae ee? Zr 


These polynomials can again be thought as being arranged in a two- 
dimensional array. Scheme 8.3 shows a segment of this array for m = 2. 


1 —gittD 
Zz p& + D(z) 
1 —| ae +2) 
Zz pik + 2 z) 
Per (z) 
I Pasi (z) 
pe 
1 
Scheme 8.3 


The zeros 2,41, 2442) +++» Ze+m Can now be obtained from the polynomials 
pv *™ by virtue of the following theorem: 


Theorem 8.3c If the zeros satisfy (8-14), then for each fixed z 
(8-16) lim DY *™(z) = (Z — 2e41)(Z — Ze40)---(Z — Zam) 


i.e., the coefficients of the polynomials p{**™ tend for n —> oo to the 
coefficients of the polynomial with zeros Z,43,..., Zc4m and leading 
coefficient 1. 
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For m = 1 theorem 8.3c reduces to relation (8-3a). For m = 2 (the 
practically most frequent case) the converging polynomials are given by 


p&*(z) _ z[z — qehy oe piace 6 oe grt 1] 
= 72 _ (Qe ae get 2)z + Geer age rs, 


Relation (8-16) here means that the limits 


(8-17a) lim Qn? + qn *”) = Ais 
(8-17b) lim qt gk +2) —_ B,., 


exist, and that the polynomial z? — A,z + B, has the zeros z,,, and 
2k+2° 

We finally mention the following fact, which could (and will) play the 
role of a computational check. 


Theorem 8.3d The quantities e& calculated from (8-5a) with k = N 
are identically zero. 


Examples illustrating the above theorems will be given in §8.5. 


Problems 


7. Prove theorem 8.3a in the case of a polynomial of degree N = 2 whose 
zeros satisfy |zi| > |z2| > 0. 
8. Show that 


Dr+3 (1) En+2 
Gace on ? 


gig? = Gi. +P = - 
n+2 


Dna+e 
where D, and £, denote the determinants defined by (7-22) and (7-23). 
Thus conclude that in the case z; = Z, theorem 8.3c is equivalent to 
theorem 7.5. 

9. Obtain the two dominant complex conjugate zeros of the polynomial 


P(z) = 81z* — 108z? + 24z + 20 


from the QD scheme by using the relations (8-17) with k = 0. 
10. Assuming that the quantities x, are given by (7-17), prove theorem 8.3d 
for polynomials of degree N = 1 and N = 2. 


8.4 Numerical Instability 


As described above, the QD scheme is built up proceeding from the left 
to the right. The sequence {x,} determines the first g column {q{"}; from 
it we obtain in succession the columns {e‘}, {¢@},..., by means of the 
relations (8-5). The reader will be shocked to learn that this method of 
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generating the QD scheme is not feasible in practice, because it suffers 
from numerical instability due to severe loss of significant digits. 

The concept of numerical instability was already briefly alluded to in 
§1.5. In mathematical analysis, real (or complex) numbers are always 
conceived as being determined with infinite accuracy. Mathematically, 
this infinite accuracy is expressed by the Dedekind cut property (see 
Taylor [1959], p. 447; Buck [1956], p. 388). Numerically, the real 
numbers of mathematical analysis can be thought of as infinite decimal 
fractions. In computation a real number z is—with rare exceptions— 
never represented exactly, but rather approximated by some rational 
number z*, e.g., by a decimal fraction with eight decimals.t| The quantity 
z* — z is called rounding error, or sometimes also absolute rounding error, 
in order to distinguish it from the relative rounding error |z* — z|/|z]. 
The following simple rules concerning the propagation of rounding errors 
follow easily from first principles: 

(i) A sum or difference of two rounded numbers a* and b* has an 
absolute rounding error of the order of the sum of the absolute errors of 
a* and b*., 

(ii) A product or quotient of two rounded numbers a* and b* has a 
relative error of the order of the sum of the relative errors of a* and b*. 

The QD scheme offers an interesting illustration of these rules. To 
simplify matters, let us assume that the sequence {x,} is generated without 
rounding error (this is possible if p is a polynomial with integer coefficients 
and leading coefficient 1), and that the situation covered by theorem 8.3a 
obtains. If we carry ¢t decimals, the numbers q{” then will have (absolute) 
rounding errors of the order of 10~'. The elements ef, formed by 
differencing the g{”, will by rule (i) have absolute rounding errors of the 
same magnitude. However, since by (8-10) the e—? tend to zero, their 
relative errors become larger and larger. Thus, by rule (ii), the ratios 
e) ,/e are formed less and less accurately, and, again by rule (ii), the 
relative error of the quantity g, as determined by (8-5b), increases with- 
out bound. Since g{? tends to a nonzero limit as n — oo, the same is 
true for the absolute error of g@. It is clear that this numerical instability 
becomes even more pronounced as k increases. 


EXAMPLE 

4. The loss of significant digits is illustrated already in example 2. By 
theorem 8.3d, the column e® should theoretically consist of zeros. The 
fact that these elements are not zero, and even increase with increasing n, 
shows the growing influence of rounding errors. The reader is asked to 


+ For more details the reader is referred to chapter 15. At the present stage we aim 
at a qualitative rather than quantitative understanding of rounding errors. 
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calculate the same example using numbers with ten decimals. This will 
delay, but not ultimately prevent, the phenomenon of numerical instability. 


The fact that the method of generating the QD scheme described in §8.1 
is unstable does not, of course, prevent the theorems stated in §8.3 from 
being true. These theorems concern the mathematically exact, unrounded 
QOD scheme. They can be used numerically as soon as we succeed in 
generating the QD scheme in a numerically stable manner. One method 
of avoiding rounding errors, applicable to polynomials with rational 
coefficients, would be to perform all operations in exact rational arithmetic 
(see Henrici [1956]). Fortunately, as will be seen in the following section, 
there is a simple way of generating a stable QD scheme also in conventional 
arithmetic. 


8.5 Progressive Form of the Algorithm 


The QD scheme can be generated in a stable manner if it is built up 
row by row instead of column by column, as in §8.1. To see this, we 
solve each of the recurrence relations (8-5) for the south element of the 
rhombus involved: 


(8-18a) Wy = (EP = eh) + g®, 
(8-18b) oe) = I ow) 
n+1 (k) ne 
q@, 


Let us assume that a row of q’s and a row of e’s is known, each affected by 
‘*normal’’ rounding errors of several digits in the last place. The new 
row of q’s, as calculated from (8-18a), will then, by rule (7), have absolute 
errors of the same magnitude. The fact that the relative errors in the e’s 
are large (due to the smallness of the e’s) is not important now. Further- 
more, the relative errors in the new row of e’s determined by (8-18b) are 
of the same order as in the old row of e’s, due to rule (ii). While a normal 
amount of error propagation must be expected also in the present mode 
of generating the scheme, it is much less serious than when the scheme is 
generated column by column. 

If the scheme is to be generated row by row, a first couple of rows must 
somehow be obtained. The following algorithm shows how this is 
accomplished. 


Algorithm 8.5 Let do, a), ..., @y be constants, all different from zero. 
Set 
a 


(8-19a) gP = i gQ®, =0, k=2,3,...,N; 
0 
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(8-19b) ey = “Ett, k=1,2,....N—1. 
k 


Consider the elements thus generated as the first two rows of a OD 
scheme, and generate further rows by means of (8-18), using the side 
conditions 


(8-20) ec = eb) = 0, eae Uap Saree 


As before, there is the theoretical possibility of breakdown of the scheme 
due to the fact that a denominator is zero. However, we have 


Theorem 8.5 If the scheme of the elements g{ and e defined by 
algorithm 8.5 exists, it is (for n 2 0) identical with the QD scheme of 
the polynomial 


P(Z) = aoz% + ayzN~1 4.+--+ ay, 
where the sequence {x,} is started in the manner (8-3). 


The proof of theorem 8.5 requires some involved algebra and it is 
omitted here. A necessary and sufficient condition for the existence of 
the scheme defined by algorithm 8.5 is that the determinants (8-7) be 
different from zero also forn > —k. (Here we have to interpret x, = 0 
forn < 0.) 


EXAMPLES 
5. The top rows of the scheme for N = 4 look as follows: 


a 
a 0 0 0 
ao 
‘ a as a 
ay ag ag 
(1 (2 (3) (4) 
qi ? qo , q-1 q- 
0 e)) ef?) e), 
1 (2 (3) (4) 
qi? qy> qb qe} 
0 ew e(2) e3) 
Scheme 8.5 


6. For the polynomial 
p(z) = 128z* — 256z? + 160z? — 32z + 1 


we obtain the scheme shown in table 8.5a. 
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Table 8.5a 
gi? ET ee. 
2.000000 .000000 .000000 .000000 
— .625000 — .200000 — .031250 
1.375000 .425000 .168750 .031250 
— 193182 — 079412 — 005787 
1.181818 .538770 .242375 .037037 
— .088068 — .035725 — .000884 
1.093750 .591114 277215 .037921 
— .047596 — .016754 — 000121 
1.046154 .621956 .293848 .038042 
— .028297 — .007915 — .000016 
1.017857 .642337 .301748 .038058 
— .017857 — .003718 — 000002 
1.000000 .656476 .305464 .038060 
— 011723 — 001730 — ,000000 
.988277 .666468 .307194 .038060 
— .007906 — .000798 — ,000000 
.980372 .673576 .307992 .038060 
— .005432 — .000365 — .000000 
.974940 .678643 .308356 .038060 
— 003781 — .000166 — .000000 


All e columns tend to zero, thus by theorem 8.3a the polynomial has four 
zeros whose absolute values are different. The g columns converge to 
the zeros, whose exact values are as follows: 


Z, = 0.96194, z. = 0.69134, zz = 0.30866, z, = 0.03806. 
7. For the polynomial 
P(z) = z* — 8z? + 392? — 62z + 50 


algorithm 8.5 yields the following QD scheme: 


Table 8.56 
Gq» ef) qa e?) qe e® qi? 

8.000000 .000000 .000000 .000000 
— 4.875000 — 1.589744 — .806452 

3.125000 3.285256 .783292 .806452 
— 5.125000 — .379037 — .830296 

— 2.000000 8.031220 .332033 1.636748 
20.580000 — .015670 — 4.092923 

18.580000 — 12.564451 — 3.745220 5.729671 
— 13.916921 — .004671 6.261609 

4.663079 1.347799 2.521060 — .531938 
— 4.022497 — .008737 — 1.321186 

.640581 5.361559 1.208612 -789248 
— 33.667611 — .001970 — .862761 


— 33.027030 39.027201 347820 1.652009 


We now have e? > 0, but e and e do not tend to zero. This is an 
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indication that there are two pairs of complex conjugate zeros. To make 


use of theorem 8.3c, we form the quantities 
1 2 =. 1 k+2 
A®) = qi? + qt , Bo = qikt gf +2) 


for k = 0 and k = 2, obtaining the following values: 


Table 8.S5c 
A® B® A® BY 

6.410256 26.282051 1.589744 0 
6.031220 25.097561 1.968780 1.282051 
6.015549 25.128902 1.984451 1.902439 
6.010878 25.042114 1.989122 1.992225 
6.002141 25.001372 1.997859 1.989741 
6.000171 25.000110 1.999829 1.996637 


The limits are 6, 25, 2, 2 respectively, indicating that the polynomials 
z7—6z+25 and z?—2z4+2 


are quadratic factors of the given polynomial. In fact, 


(z? — 6z + 25)(z? — 2z + 2) = z* — 823 + 39z? — 62z + 50. 


We have yet to deal with the complication that arises if some of the 
coefficients dp, a;,..., @y are zero. In that case the extended QD scheme 
defined by algorithm 8.5 clearly does not exist, since some of the relations 
(8-19) may fail to make sense. A possible remedy is to introduce a new 


variable 
2 =2 4 


and to consider the polynomial 
p*(2*) = pla + 2*) 
! oes 1 
= p(a) + FP (az* + a P (a)z¥2 +--+ + wie (az**. 


Here a denotes a suitably chosen parameter. The coefficients of the 
polynomial p* can easily be calculated by means of algorithm 3.6. It can 
be shown that if p has some zero coefficients, then all coefficients of p* 
are different from zero for sufficiently small values of a #0. If the 
zeros z* of p* have been computed, those of p are given by the formula 


ie 2, 


EXAMPLE 
8. Let p(z) = 81z* — 108z° + 24z + 20. Here a, = 0, and algorithm 
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8.5 cannot be started. We form p* with a= 1. Scheme 3.6 turns out 
thus: 


81 — 108 0 24 20 
81 —27 —27 —3 17 
81 54 27 24 

81 135 162 

81 216 

81 


The new polynomial 
p*(z) = 81z** + 216z*8 4+ 1622*? + 242* + 17 


has all coefficients different from zero, and the first rows of its QD scheme 
are as follows: 


216 162 24 


Problems 
11. Construct the QD scheme for the polynomial 
P(z) = 3223 — 48z? + 18z — 1 
and determine its zeros to four significant digits. Check your result by 
observing that z = 4 is a zero. (The convergence of the g columns may 
be sped up by Aitken’s 4?-process.) 
12. Determine approximate values for the zeros of the polynomial 
p(z) = 70z* — 140z3 + 90z? — 20z + 1. 


Then find more exact values by Newton’s method. 

13. The polynomial 

p(z) = z® — 3z* — 20z? + 60z2 — z — 78 

has two large real zeros of opposite sign, a pair of complex conjugate 
zeros, and a small real zero. Find approximate values for the quadratic 
factors belonging to the two large real zeros and to the pair of complex 
zeros. 

14. Prove theorem 8.5 for polynomials of degree N = 2. (Hint: It suffices to 
show that algorithm 8.5 generates the correct values of q§”, e?, and g{”.] 


8.6 Computational Checks 


Even if the QD scheme is generated by algorithm 8.5, excessively large 
(or small) elements may cause some loss of accuracy. The mathematical 
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results given below may be used for checking purposes; failure of any of 
these checks indicates excessive rounding error. 

(i) Relation (8-5a) implies that the sum of the qg values in any row of the 
scheme is constant. For the top row this sum is —a,/dao, which by Vieta’s 
formula (see §2.5) equals the algebraic sum of the zeros of the polynomial. 
Thus we have 


(8-21) QP + Pr te + GPwas = Se 
0 
Note that by theorem 8.3a this relation confirms Vieta’s rule for n > oo! 
(ii) It can be shown that the product of all g elements in any diagonal 
sloping downward is likewise independent of n and equals the product of 
all zeros of the polynomials p. Thus again by Vieta, 


(8-22) qq? ...gQ? = (— 1). 


It should be noted that while (8-21) checks only the additions and sub- 
tractions performed in constructing the scheme, equation (8-22) checks all 
operations. 

(iii) If the QD scheme is generated by algorithm 8.5, the quantities x, 
are not needed. However, we may calculate the x, from (8-2) and should 
find 


(8-23) wat} =g? n=0,1,2,.... 


Problem 


15. Prove (8-22) for the QD scheme arising from a polynomial of degree 
N = 2. 


8.7. QD versus Newton 


In comparison with other methods for determining the zeros of a 
polynomial, the QD algorithm enjoys the tremendous advantage of 
furnishing simultaneously approximations to all zeros of a polynomial. 
No information about the polynomial other than the values of its 
coefficients is required. 

These advantages have to be paid for by the rather slow convergence of 
the algorithm. Since the QD method contains the Bernoulli method as a 
special case, the convergence can be no better than that of Bernoulli’s 
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method. In fact it can be shown that under the hypotheses of theorem 
8.3b the errors gq — z, tend to zero like the larger of the ratios 


Ati 2 ba 
( ug and (+) ; 
Zh-1 Zk 


Even if the figures in the g columns eventually settle down, the accuracy 
of the zeros thus obtained is somewhat uncertain, because the large 
number of arithmetic operations may have contaminated the scheme with 
rounding error. 

For the above reasons, the QD algorithm is not recommended for the 
purpose of determining the zeros of a polynomial with final accuracy. 
Instead, the following two-stage procedure is advocated: 

Stage 1: Use the QD algorithm to obtain crude first approximations to 
the zeros, respectively to the quadratic factors containing complex 
conjugate zeros. 

Stage 2: Using these approximations as starting values, obtain the zeros 
accurately by Newton’s or Bairstow’s method. 

This combination of several methods has the advantage that the final 
values of the zeros are obtained from the original, undisturbed polynomial, 
and thus are practically free of rounding error. 

The choice of the point at which to make the change-over from QD to 
Newton-Bairstow is, to some extent, arbitrary. It is probably best to 
carry the QD scheme to a point where the division of the scheme into 
subschemes in the manner described after theorem 8.3b is clearly evident. 
On the other hand, if QD is pushed too far, a lot of computational effort 
may be wasted, since Newton-Bairstow usually takes only two or three 
steps to obtain the zeros very accurately even from mediocre first approxi- 
mations. To fuse the three algorithms into one working program 
constitutes a challenging but rewarding problem in machine programming 
which is highly recommended to the reader. One such program is 
described by Watkins [1964], who also presents the results of extensive 
machine tests. 





Problem 


16. By combining QD and Newton-Bairstow, compute all zeros of the 
following polynomials with an error of Jess than 1077: 


(a) p(z) = z* — 8z3 + 39z? — 62z + 51; 
(b) P(z) = 25 — 15z* + 85z3 — 225z? + 264z — 120; 
(c) p(z) = 4z® — 525 + 4z* — 3z3 + 7z* — 7z 4+ +1. 


8.8 Other Applications 
In addition to the calculation of the zeros of a polynomial, the QD 
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algorithm has many other applications, notably in the theory of continued 
fractions, in matrix computation, and in the summation of divergent series 
(see the references given below). It can also be used to furnish exact 
bounds (and not only approximations) for the location of the zeros of a 
polynomial. We shall mention only one further application which is 
related to the one discussed above. 

Suppose the function fis defined by the power series 


(8-24) Sz) = > ayz", 
n=0 
where a, # 0,” = 0,1, 2,..., and let it be known that the zeros z,, of fare 


real and positive, 0 < z, < z, <--:-. 


EXAMPLE 
9. The Bessel function of order zero can be defined by 


J2vx) = > a 





(8-25) 
and has the required properties. 


Obviously the QD scheme of such a function cannot be found in the 
ordinary way, because the horizontal rows are now infinite. However, 
the scheme may be generated in the following manner, as indicated by the 
arrows: 


= 0 0 
a2 ag a4 
a az Ag 
ZL v 4 
q\? qs” q@ 
ef ef) 
x x 
qs? qy 
ep 
if 
q3 


Scheme 8.8 


178 elements of numerical analysis 


Although the scheme does not terminate on the right, more and more 
diagonals sloping upward can be found. It can be shown that if the 
scheme exists, 

; 1 

lim gf? = —, al eae 

n— oo Zk 
Under certain conditions even complex conjugate zeros of transcendental 
functions can be found in this manner, using the method described in 
theorem 8.3c and in example 7. 


EXAMPLE 
10. For the function defined by (8-25) the coefficients a, evidently satisfy 


Qnga dL 
a, (n+l)? 





The first few diagonals of the Q D scheme thus appear as follows: 


1.000000 .000000 .000000 .000000 .000000 
— .250000 —.111111 — .062500 — .040000 
.750000 .138889 .048611 .022500 012222 
— .046296 — 038889 — 028929 — 021728 
.703704 .146296 .058571 .029700 
— 009625 — .015570 — .014669 
.694079 .140351 .059472 
— 001946 — .006597 
.692133 .135700 
— .000382 
.691751 
.691660 .131271 
Problems 


17. Obtain an approximate value of V 2/7 by applying the algorithm described 
above to the function 
i eet J 
cos Vz = 
2 (2n)! 
18. Apply the above version of the QD algorithm to the problem of finding 
the small solutions of the transcendental equation 


tan z = cz 
by setting 
sin Zz 
f(z?) = — 7 € 608 z. 


Find the smallest positive solution for c = 1.2, and a pair of purely 
imaginary solutions for c = 0.8. 
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Recommended Reading 


The QD algorithm was introduced by Rutishauser in a series of classical 
papers which are collected in the volume Rutishauser [1956]. A somewhat 
more elementary treatment is given by Henrici [1958]. A multitude of 
applications are discussed in Henrici [1963]. 


Research Problem 


How can the argument of a complex zero of a real polynomial be 
determined from the signs of the elements of the corresponding g column? 
Consider the case N = 2 first. 


Created by Universal Document Converter 
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INTERPOLATION 


AND 
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chapter 9 the interpolating polynomial 


So far we have been concerned mainly with the problem of approximating 
numbers (such as the zeros of a polynomial or the solutions of systems of 
nonlinear equations). We now turn to the problem of approximating 
functions and, more generally, numbers, such as derivatives and integrals, 
that depend on an infinity of values of a function. 

The most common method of approximating functions is the approxi- 
mation by polynomials. Among the various types of polynomial approxi- 
mation that are in use the one that is most flexible and most easily 
constructed (although not always the most effective) is the approxima- 
tion by the interpolating polynomial. 


9.1 Existence of the Interpolating Polynomial 


Let the real function f be defined on an interval /, and let xo, x1,..., X, 
be 2 + 1 distinct points of J. It is not assumed that these points are 
equidistant, nor even that they are in their natural order. 


We shall write for brevity 
SX) = Ses ee) | eee, B 


Theorem 9.1 There exists a unique polynomial P of degree not 
exceeding n (the so-called Lagrangian interpolating polynomial) such 
that 


(9-1) P(x.) =f k=0,1,...,0. 


Proof. As usual, the proofs of existence and of uniqueness require 
separate arguments. The existence of the polynomial P is proved if we 
183 
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can establish the existence of polynomials L, (k = 0,1,...,”) with the 
following properties: 


(i) Each L, is a polynomial of degree Sn; 
(ii) For x = x,, L, has the special value 


(9-2a) L(x) = 1; 
however, if m 4 k, then 
(9-2b) Li{Xm) = 9. 


Assuming the existence of these L,, we can set 


(9-3) P(x) = > FL 


The function P is a sum of polynomials of degree <n with constant factors 
and thus itself a polynomial of degree <n. Furthermore, if we set 
x = X,, then by (if) all L, are zero except for the one with k = m, and this 
has the value 1. Thus we find 


P(Xm) = Sms 
as required. 
The polynomials L,, are called the Lagrangian interpolation coefficients. 
To prove the existence of L,, observe that the product 


AE Xa Xm _ _(% = Xo) (H — He- OH — igo) + = Hn) 


m=0 ~k ~ *m (X~ — Xo). +e — Xe -1)%K — Nei). +e — Xn) 
m#k 
has the required properties. Indeed, as a product of n+1—l=n 
linear factors it represents a polynomial of degree n; furthermore, if 
x = x,, then all factors have the value 1, thus the product has the value 1 
also. On the other hand, if x = x, where m # k, then the factor con- 
taining x — x,, is zero, and the product vanishes. Thus, the polynomials 


LO eee, 2 
0 ne) = [1 ea 
mé#k 


have the required properties, and the existence of the polynomial P is 
proved. 

In order to show the uniqueness of the interpolating polynomial, assume 
there exist two interpolating polynomials, P and Q, say. Then their 
difference D = P — Q, being the difference of two polynomials of degree 
not exceeding 7, is again a polynomial of degree not exceeding n. More- 
over, 


D(x;) = P(x) — O(%%) = fe — fe = 9 
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fork = 0,1,...,n. The polynomial D thus has 2 + 1 zeros and hence, 
being a polynomial of degree <n, must vanish identically. It follows that 
P =@Q. This completes the proof of theorem 9.1. 


EXAMPLE 

1. To find the interpolating polynomial for the following x, and f,: 
Xx 2 3 —] 4 
Tre l 2 3 4 


We first calculate the Lagrangian interpolation coefficients. Formula 
(9-4) yields 
(x — 3)\(x + lx - 4 4 


L(x) = (= 1)3(—2) o(x _ 3)(x + 1)(x ad 4), 
L(x) = SAAS EAD = aoe = Ux + NG = 4), 

Py ah) Cid) Cae) ee = = 
L(x) = (aaa (x — 2)(x — 3)(x — 4), 
L(x) = FADE DOF aisle — ox — 3 + 0. 


Formula (9-3) thus yields 


P(x) = 4x — 3)(x + “DO — 4) — 40x — 20% + “WYK - FY) 
— si5(x — 2)(x — 3)(x — 4) + F(x — 2)(x — 3)x + *21?~ 


It can be verified that this polynomial has the required properties. 


It will be noted that the representation of the interpolating polynomial 
given in the proof of theorem 9.1 does not give the polynomial in the 
customary form 

P(x) = Aox" + ayx"™ 2 +--+ ay. 


Of course, the polynomial could be put into the above standard form, but 
there usually is no particular reason for doing so. It is well to distinguish 
at this point between the function P and the various representations of P. 
As a function (i.e., as a set of ordered pairs (x, P(x))), P is unique. How- 
ever, there may be many ways of representing P by an explicit formula. 
Each formula suggests a certain algorithm for calculating P. It is not 
claimed that the algorithm suggested by (9-3) is the most effective from 
the numerical point of view. Many other algorithms for constructing 
the polynomial will be discussed in the chapters 10 and 11. 


Problems 


1. Verify that the case n = 1 of (9-3) yields the familiar formula for linear 
interpolation. What is the meaning of theorem 9.1 when n = 0? 
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2. 


3. 


4. 


9.2 
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Is the interpolating polynomial constructed above always of the exact 
degree 1? 

Show that for x # x,, k = 0,1,..., 2, the interpolating polynomial can 
be represented in the form 


P(x) = L(x) lear (x — Gare} (xx) 
where 


L(x) = (x — Xo)(x — X1)...(% — Xn). 


Verify by L’Hopital’s rule (see Taylor [1959], p. 456) that the limit of the 
expression on the right as x — Xm iS fn. 
Prove: If fis a polynomial of degree 1 or less, then P = f. 


The Error of the Interpolating Polynomial 


Since we wish to use the interpolating polynomial to approximate the 
function f at points which do not belong to the set of interpolating points 


Xk» 


we are interested in estimating the difference P(x) — f(x) for xeEJ. 


It is clear that without further hypotheses nothing whatever can be said 
about this quantity. For we can change the function / at will at points 
which are not interpolating points without changing the polynomial P 
at all (see Fig. 9.2). 

A definite statement can be made, however, if we assume a qualitative 
knowledge of the derivatives of the function /- 





Figure 9.2 
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Theorem 9.2 In addition to the hypotheses of theorem 9.1, let f be 
n+ 1 times continuously differentiable on the interval 7. Then to 
each x e/ there exists a point £, located in the smallest interval 
containing the points x, Xo, X1,..., X, such that 


(9-5) fe) - PQ) = ea LOOMED) 
where 


L(x) = (x — Xo)(X — X1)..-(% — Xz). 


Proof. If x is one of the points x,, there is nothing to prove, since both 
sides of (9-5) vanish for arbitrary é. If x has a fixed value different from 
any of the points x,, consider the auxiliary function F = F(t) defined by 


(9-6) F(t) = f() — P® — cL), 
where 
_f@) = P@), 
L(x) 
We have 


F(x4) = f(X") — P(Xn) — cL (xx) 


= Q, k= 0,15 2).905%, 
and also 
F(x) = f(x) — P(x) — cL(x) = 0, 


by the definition of c. The function F thus has at least n + 2 distinct 
zeros in the interval /. By Rolle’s theorem, the derivative F’ must have 
at least n + 1 zeros in the smallest interval containing x and the x,, the 
second derivative must have no less than n zeros, and finally the (7 + 1)st 
derivative must have at least one zero. Let €, be one such zero. Wenow 
differentiate (9-6) n + 1 times and sett = €,. The (# + 1)st derivative of 
P is zero. Since L is a polynomial with leading term x"*+, the (7 + 1)st 
derivative of cL is c(n + 1)!. We thus have 


0 = FOO(E,) = PME.) — en + I)! 


or, remembering the definition of c and rearranging, 


eL(x) = f(s) — PQ) = Gap LOM UE) 


as was to be shown. 

Equation (9-5) cannot be used, of course, to calculate the exact value of 
the error f — P, since &, as a function of x is, in general, not known. 
(An exception occurs when the (n + 1)st derivative of f is constant; see 
below.) However, as is shown in the examples below the formula can 
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be used in many cases to find a bound for the error of the interpolating 
polynomial. 
We also shall require the following fact. 


Corollary 9.2. Under the hypotheses of theorem 9.2, the quantity 
f™*(é,) in (9-5) can be defined as a continuous function of x for 
xel, 


Proof. Define the function g = g(x) by 


a(x) = 4 ,, 
k 


This function is continuous for x # x, and, by L’Hopital’s rule (see Taylor 
[1959], p. 456) also at the points x,. For x # xx, 


SO* ME) = (n + 1)!8(X), 
establishing the corollary. An application of this corollary will be made 
in the chapters 12 and 13. 


EXAMPLES 
2. One interpolating point. If there is only one interpolating point Xo, 
the interpolating polynomial reduces to the constant fp. Formula (9-5) 
yields 

L(x) — f(%0) = (& — xo) f"(Ex), 
where &,, lies between x, and x. This is the familiar mean value theorem 
of the differential calculus. 
3. Two interpolating points. The linear interpolating polynomial is 
given by 

P(x) = (%1 — X)fo + % — Xo)fr, 

X1 — Xo 


Equation (9-5) yields the error formula 
X — Xo Mx — X1) 
fx) — P(x) = FAVE ~ AD preg, 


What is the maximum error that can occur if we know that |f”’(x)| S$ M, 
and x is between x) and x,? The maximum of the function 


14(x — xox — %1)| 
between x, and x, occurs at x = $(xp + x,) and has the value 4(x, — xp). 
Thus we find 
= 2 
ie) — Pay] s 252, 
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in this case. Application: If we calculate the value of sin x from a sine 
table with step A, using linear interpolation, the error is bounded by 
th?, since Mz = 1 in this case. 

4. Error in cubic interpolation. We assume that the four interpolating 
points are equidistant, x, = x9 + kh, k = 1, 2,3, and that x (the point 
where the value of the function fis sought) always lies between x, and Xo. 
We set 

M,= max |f()|. 


ToSIT81%93 


The interpolation error is then bounded by (1/4!)M, times the maximum 
of the absolute value of the function 

L(x) = (X — Xx — x1)(% — X2)(x — x3) 
in the interval x, S$ x S x2. For reasons of symmetry this maximum 
occurs at x = (x, + x.)/2 and has the value 


[GAGAP = zeh*. 


It follows that the interpolation error is bounded by 
4 
jag HMe 


In a sine table, for instance, using cubic interpolation and a step as large 
as h = 0.1 we get a maximum error of less than 2.5 x 1078. 


The basic error formula (9-5) requires the knowledge of a bound of the 
derivatives of the function f/ Such bounds can often be obtained very 
easily even for non-elementary functions by exploiting known functional 
relations. 


EXAMPLE 
5. The Bessel function of order zero can be defined by 


Jo(x) = i cos (x sin f) dt. 
7 Jo 
By differentiating under the integral sign, 


fej ee =, sin #sin Cesin 1).9%, 


Jo(x) = - i, (sin £)? cos (x sin f) dt, 


etc. The integrands of all integrals which we obtain by differentiation are 
bounded in absolute value by 1. Thus 


\J%x)| < ={- tars ONE oct 
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Problems 


5. 


10. 


11. 


12. 


A table of a function of one variable is well suited for linear interpolation 
if the error due to interpolation does not exceed the rounding error of the 
entries. What is the greatest permissible step of such a ‘“‘well inter- 
polable”’ table of cos x as a function of the number of decimal places, 
(a) if x is given in radians; (b) if x is given in degrees? Make a survey of 
some tables accessible to you and decide whether they are well suited for 
linear interpolation. 


. What is the maximum value of the combined error due to linear inter- 


polation and rounding of the formula 


I(x) ~ 








XxX — Xo xX, — x 
A ag fos 
X1 — Xo X1 — Xo 


if f; and fo are known to N places, and if the products are rounded to N 
places? (It may be assumed that the fractions (x — xo)/(x1 — Xo) and 
(x; — x)/(x1 — Xo) are exact decimal fractions.) 


. The function logio (sin x), where x is given in degrees, is tabulated to five 


decimal places with a step of 1/60 of one degree. From what value of x on 
is this table well suited for linear interpolation ? 


. The function f(x) = Vx is tabulated at the integers, x = 1, 2, 3,..., 


giving four decimals. From what x on is this table well suited for linear 
interpolation ? 


. The Bessel function of order n can be defined by 


Jn(x) = E i ‘ cos (x sin t — nt) dt. 


How do we have to choose the step / of a table of J, so that the error is 
less than 10-° 

(a) if linear interpolation is to be used? 

(b) if cubic interpolation (as described in example 4) is to be used? 
Interpolation near the end of a table. The fourth derivative of a function f 
is known to be bounded by M,. Let P be the polynomial interpolating 
fat the points x, = kh, k = 0,1, 2,3. Give the best possible bound for 


max |f(x) — PQ). 


to<TSz 


Theorem 9.2 implies that if |f"*?(x)| S Masi, xeJ, then, in the 
notation of theorem 9.2 


M, +1 
= ee een eS 
f(x) — P(x)| S a+! \L(x)|, xel. 
Are there any functions f (and corresponding points in J) for which this 
inequality becomes an equality? 
Let n be a positive integer. Somebody proposes to calculate the value of 
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e"*1 by constructing the polynomial P interpolating the function f(x) = e* 
at the points x = 0,1,..., and evaluating P for x = 2 + 1. 

(a) Indicate a Jower bound for the error e”*! — P(n + 1). 

(b) Determine the number &, of theorem 9.2, and thus obtain an exact 
expression for the error. 

13. The function / is defined on [0, 1] and is known to have a bounded second 
derivative. Its values are to be computed from a fixed interpolating 
polynomial using two interpolating points x» and x,. How should one 
place the points x») and x, in the interval [0, 1] in order to minimize the 
error due to interpolation? 


9.3 Convergence of Sequences of Interpolating Polynomials 


Let the function f be defined for — 00 < x < oo, and let it and all its 
derivatives be bounded by one and the same constant, 


FO SM, n=0,1,...; -o<x< 0. 


Assume one wishes to calculate f(x) in the interval [0, 4] by means of the 
interpolating polynomial P of degree 2n — 1 using the interpolating points 


Xo = 0, +1 = h, 
Xo = —h, X= Qh, 
Xon-2 = —(n — 1)h, Xon-1 = nh. 


For what values of 4 does the error tend to zero as n > «0? 

Obviously, the above procedure cannot be effective for unrestricted 
values of h, as the example f(x) = sin x, h = mshows. (All interpolating 
polynomials are zero in this case.) By theorem 9.2, the interpolation 
error of P is bounded by M|L(x)| /(2n)!, where 


L(x) = [x + @ — DAJLx + (2 — 2)A)... [x — nh]. 


The maximum of the function |£(x)| on the interval [0, A] occurs at the 
point x = A/2. At this point, 


I (5) | _ (@ — dh — 5)... SAP 








(2)! |"\2 @n)! 
_ [(2n — 1)Qn — 3)...3-1]2h?" 
ry iC) 
— F@n)']? hn 
7 Sear 2"(2n)! 
(2n)! Jo, 





— -24"(n!)? 
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Using Stirling’s asymptotic formula for z!, 
nt ~ Van (2) 

(see Buck [1956], p. 159) we find 





Vin2a( =)” 





1 h e@ 
GNA pee 2an(2) 


Os 


The last expression tends to zero ifn — oo ifand only if |4| < 2. Thus the 
convergence of the interpolation process described above can be guaranteed 
only if || < 2. 

The sequence of interpolating polynomials constructed above looks 
somewhat unnatural in view of the fact that we use interpolating points 
farther and farther removed from the interval where we wish to approxi- 
mate the function f. The following question, however, is very natural: 
Let f be continuous on the interval [0, 1] and denote by P,(x) the poly- 
nomial interpolating f at the points 


x= k=0,1,...,7. 


Is it true that 
(9-7) lim P,(z) = f(x) 


for all x € [0,1]? An important result due to Runge states that there are 
continuous functions for which (9-7) does not hold. (A simple example is 
f(x) = |x —4].) Actually, the relation (9-7) even fails to hold for some 
functions which have derivatives of all orders. 

It is important, however, to understand Runge’s result correctly. The 
result does not mean that a continuous function cannot always be approxi- 
mated by polynomials. In fact, a famous theorem due to Weierstrass 
(see Buck [1956], p. 39) states that every f continuous on a closed finite 
interval J can be approximated by polynomials to any desired accuracy. 
Runge’s result merely states that these approximating polynomials can in 
general not be obtained by interpolation at uniformly spaced points. 


Problems 


14. Missing entry in a table. A function f is defined on the whole real line 
and satisfies 


[fom™)| SM" -m<x< wo; m=0,1,2,... 
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for some constant M. Forn = 1,2,... let Pen, denote the polynomial 
interpolating fat the points —”, —n + 1,..., —1,1,...,n — 1,”. Prove 
that 


lim Pon -1(0) = f(0) 


holds provided that M < 2. 
15. A function f is defined for x 2 0 and satisfies 


[FPO)| 4; x 2 0, m= 0,1, 2,.... 


For a fixed value of h, let P, denote the polynomial interpolating f at the 
points 0, A, 2h,..., nh. For what values of A can you guarantee that 


lim P,(x) = f(x) 


for every fixed value of x > 0? 

16*. Let the function f be continuous on the interval [0,1]. From the fact 
that such a function is uniformly continuous (see Buck [1956], p. 34) one 
can easily prove that fcan be approximated to arbitrary accuracy by the 
piecewise linear function coinciding with f at suitable points xo, x1,..., 
X,. Thus, in a sense, f can be approximated arbitrarily well by linear 
interpolation. Why does this not contradict Runge’s theorem? 

17. A function fis defined on [0, 1], and its derivatives satisfy 


[fo (x)| < ml, m=0,1,2,...,08.* <1. 


(Example: f(x) = (1 + x)71.) Let P, denote the polynomial inter- 
polating fat the points 1, g, g?,...,q", where g is some number such that 
0<q<1. Show that 


lim P,(0) = f(0). 


9.4 How to Approximate a Polynomial of degree m by One of Degree n — 1 
Let O be a polynomial of degree » with leading coefficient 1, 
O(x) = x" + ag_yx"7) +--+ 4+ a. 


We wish to interpolate Q in the interval [—1, 1] by a polynomial P of 
degree » — 1 such that the maximum of the error |Q(x) — P(x)| is 
minimized. How do we have to choose the interpolating points x9, x,, 


Xg,...,X,—-1, and how large is the smallest possible maximum error? 
From the general error formula (9-5) we find, since O™(x) = n!, 

(9-8) O(x) — P(x) = L(x) 

where 


L(x) = (* — Xo)(X — X11). .-(% — Xa) = x +--- 
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Our problem is thus equivalent to the problem of selecting the points 
Xo, X1)+++,Xn-11n such a manner that the quantity 
aves, [(x — xXo)(x — x1)...(% — Xn-1)I 


is minimized. Although seemingly difficult this problem can be solved 
explicitly. 


Theorem 9.4 The best choice of the interpolating points x9, x1,..., 
X,-1 for the approximation of the polynomial Q(x) = x"+.--- inthe 
interval -1 < x < 1 by a polynomial of degree n — 1 is the choice 
for which 


(9-9) L(x) = i Tr), 


where 7,, denotes the nth Chebyshev polynomial,f 


T,(x) = cos (n arc Cos x). 


Proof. Let us first convince ourselves that the function L defined by 
(9-9) really is a polynomial of degree n with leading coefficient 1. From 
the difference equation satisfied by the Chebyshev polynomials, 


T(x) = 2xT,-1(x) — T,-2(x) 


and from the fact that 7)(x) = 1, 7\(x) = x it readily follows that T,, 1s a 
polynomial in x with leading coefficient 2*~'. Hence our assertion on L 
follows immediately. 

Since we are interested in minimizing the maximum of |LZ(x)|, let us 
calculate the extrema of L(x). The extrema of cos x occur for x = kz, 
where & is an integer, hence the extrema of L(x) in the interval [—1, 1] 
occur at the points where n arc cos x = kz, 1.e., for x = t,, where 


k 
t, = cos —"s K=O, Leo ign: 


The values of Z at these points are 
L(t,) = (-1*2-"*?, 
i.e., the extrema all have the same absolute value 2~"*1?, but oscillate in 


sign (see Fig. 9.4). 


+ See example 3, chapter 6. 
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y=L(x) (n=9) 





Figure 9.4 The function L(x) ( = 9). 


Now suppose there exists another polynomial M(x) = x" +.--- for 
which |M(x)| has a smaller maximum m < 2~"*! in [—1, 1]. Then the 
difference polynomial 


D(x) = L(x) — M(x) 
would at the n + | points ¢, have the same sign as L, i.e., 


>0, k even, 
Dt) ee k odd. 


Since fp > t; > te >-::> ¢,,1t would follow that D has at least n distinct 
zeros, namely one in each interval [t,,.1, ¢,]. However, since both L and 
M have leading coefficient 1, it follows that D is a polynomial of degree 
<n — 1, hence D cannot have » distinct zeros without vanishing 
identically. The assumption of the existence of a polynomial M with a 
maximum deviation from 0 smaller than 2~"** thus has led to 
a contradiction. 

The interpolating points x, for the best approximation of Q by a 
polynomial of lower degree are the zeros of L, that is the points x = x; 
satisfying 

narccos x = (k + 4)z, ae 00 1s fin tt 1s 


It follows that the interpolating points are given by 





x, = 008 (=3* 2), k= 0; 1,232.37 = 1. 
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EXAMPLE 
6. How can we best approximate the function 


f() =x? +ax+b 


in —1 S$ x S$ 1 byastraight line? This is the casen = 2 of theorem 9.4. 
We have 
V2 377 V2 


TT 
Xp = COS = = ’ xX, = cos—_— = --—_- 
: 4° 2 : 4 2 





The interpolating polynomial is given by 


om F(Xo)(%1 — x) + f%r)(%o — X) 


X1 — Xo 


P(x) 


_ 0G + ax + b)x — x1) + (XT + ax + b)\(Xo — x) 
= Xp — X1 


=ax+b+44. 


The maximum deviation from x? + ax + 6 in [—1, 1] is 4, as predicted 
by the theory. 


It is not necessary to use the interpolation points x, in order to construct 
the polynomial P of best approximation. From the error formula (9-8) 
we find, if Z is given by (9-9), 


(9-10) P(x) = O(8) ~ sana Tal). 


EXAMPLE 


7. We consider once more the problem of example 6. From the 
recurrence relation we easily find 7,(x) = 2x? — 1, hence 


P(x) = x? + ax + b — 4(2x? — 1) 
=ax+6+ 4, 


in accordance with the earlier result. 


Relation (9-8) shows that the error curve for the best approximation of 
a polynomial Q(x) = x" +--- by a polynomial of lower degree is given 
by 2~"*17,,(x). The discussion in the proof of theorem 9.4 revealed that 
this curve has m + 1 extrema in [—1, 1] with alternating signs, but all of 
the same absolute value. This property is shared by the polynomial P 
minimizing 
_max [f@) — PO) 


where f is any continuous function. The theory of such minimizing 
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polynomials was initiated by Chebyshev (1821-1894); it plays an 
outstanding role in modern numerical computation. 


Problems 
18. Determine the polynomial of degree <n — 1 that best approximates 
O(x) = Aox™ + aux") +---+ a, 


on an arbitrary interval a S x S b, and show that the least value of the 
maximum deviation is given by 


1 {b-a\" 
aa |Z) 


{[Hint: Reduce the problem to the special case considered above by 
introducing a new variable x* by setting 
b+a.b-a 
Soo. ho 
19. Determine a polynomial of degree <4 that provides the best approximation 
to the function f(x) = x® on the interval [0, 4]. 
20. Approximate f(x) = x? on the interval [0, 1] by a polynomial of degree 2. 
21. Approximate f(x) = x? on the interval (0, 1] by a polynomial of degree 1 
by approximating the approximating polynomial of problem 20 by a 
linear polynomial. 
22. Determine directly (by calculus) a linear polynomial P(x) = ax + b 
such that the quantity 








x] 


max |x? — ax — b| 
Osz<l 
is minimized. 
23. Prove the uniqueness of the solution (9-10) of the approximation problem 
considered at the beginning of §9.4. 


Recommended Reading 


A more general treatment of the error of Lagrangian interpolation is 
given in Ostrowski [1960], chapter 1. Fora discussion of the convergence 
of sequences of interpolating polynomials see Hildebrand [1956], pp. 114—- 
118. A first introduction to the theorem of Weierstrass and to the 
approximation of continuous functions by polynomials in general is 
given in Todd [1963]. 


Research Problems 

1. Assuming that f is sufficiently differentiable, how well does the 
derivative of an interpolating polynomial approximate the derivative of 
f? (For a partial answer, see §12.1.) 

2. By extending the procedure outlined in problem 21, how well can you 
at least approximate a polynomial of degree n by one of arbitrary degree 
m<n? 


chapter 10 construction of the interpolating 


polynomial: methods using ordinates 


After considering the more theoretical aspects of the interpolating 
polynomial in chapter 9, we shall now discuss some algorithms for actually 
constructing the polynomial. Many such algorithms have been devised, 
frequently with some special purpose in mind. There are two main 
categories of such algorithms: In the first category, the function f enters 
through its values (or “ordinates’’) at all interpolation points. In the 
second category, f enters through its value at one point, and through 
differences of the function values. Here we are concerned with algorithms 
of the first category. 


10.1 Miuller’s Method* 


For some purposes the interpolating polynomial is calculated most 
conveniently from the Lagrangian formula (9-4). The Lagrangian 
formula is especially convenient if the polynomial is to be subjected to 
algebraic manipulations. As an example of an application of the 
Lagrangian representation, we shall discuss in more detail Muller’s 
method for solving the equation f(x) = 0 mentioned in §4.11. 

The reader will recall that the essence of Muller’s method is as follows. 
Assuming that three distinct approximations x,_ 2, X,-1, X, to the desired 
solution s are available, we gain a new approximation by interpolating 
the function f at the points x,_2, X,-1, X, by a (normally) quadratic 
polynomial P. Of the (normally) two zeros of P, one closest to x,, is 
selected as the new approximation x,,,. The process then is continued 
with (X,-1, Xn» Xn+1) In place of (x,_2, X,-1, X,) and terminated as soon as 
|Xn+1 — Xn|/|Xn+1| becomes less than some preassigned number. 


+ This section may be omitted at first reading. 
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The Lagrangian representation of P is 


— (= Xn-1)(% — Xn-2) (x — Xn)(X — Xn~2) 
ee 7 (x, - Xn- W(X prs Xq-a)" (X,-1 =, Ha onea a Xa 


pee 


(Xn-2 ~ Xn)\(Xn-2 — Xp-1 


In order to write this in a more compact manner, we introduce the 
quantities 


(10-1) |e, eee, Sa h=x—xX, 
and obtain 
P(x) = P(x, + A) 


(a t+ hyh + hy + haa) _ hh + hy + hn-1) 
7 hin om h,-1) - Ayhy-1 — 


h(h + Rp) 


ar (eo 


Collecting terms involving like powers of # and writing 


(10-2) qn = 





we find 
P(x) = P(Xn + ghn) 
es (1 oT Qn) *(Anq? + Bq + C,) 


where 
An = Qnfn — Al + Qn)fn-1 + Wtfn-2, 
(10-3) B, = (24, 3 I)fh = ¢@ =a Qn) fn-1 ate Gilad 
Cs = (i - Gn) ini 


Solving the quadratic equation P(x, + gh,) = 0, we find 


Xn+1 = Xn + hnQn+1s 
where 
—B, + VB? — 44,C, 
Qn+1 = Pts iy 


In order to avoid loss of accuracy due to forming differences, this formula 
is better written in the form 


20, 


10-4 ee ee 
oP gtay ae Apr ae 
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Here the sign yielding the smaller value of g,,1, 1.e., the larger absolute 
value of the denominator, should be chosen. 

It may happen, of course, that the square root in (10-4) becomes 
imaginary. If fis defined for real values only, the algorithm then breaks 
down, and a new start must be made. If fis a polynomial, the possibility 
of imaginary square roots is considered an advantage, since this will 
automatically lead to approximations to complex zeros. 

Three starting values xo, x, X2, for the algorithm have to be provided 
from some other source. Muller recommends to start the algorithm by 
taking for P the Taylor polynomial of degree 2 of fatx =0. If fisa 
polynomial, 


F(X) = agx® + GyxN~* + +--+ ay, 
this can be achieved artificially by putting x» = —1, x, = 1, x, = 0, thus 
(10-5) h, = 2; he = —1, q2 = —4 


and setting 


fo = ay — Ay-1 + An-2; 
(10-6) Si = ay + ay-1 + Ay-o, 
So = ay. 


As soon as a zero of f has been determined, it is to be divided out by 
algorithm 3.4 in connection with theorem 4.10. 

It follows from the work of Ostrowski ([1960], p. 86, although without 
reference to Muller’s work) that Muller’s method converges whenever the 
three initial approximations are sufficiently close to a simple zero of f. 
The degree of convergence lies somewhere between that of the regula falsi 
and of Newton’s method. No convergence theorems in the large similar 
to those for the QD algorithm appear to be known. Nevertheless, the 
method is (in the United States) among the most popular for finding zeros 
of polynomials. 


Problems 
1. Use Muller’s method to find all zeros of the polynomial 


P(x) = 128x* — 256x? + 160x? — 32x + I. 


(Real arithmetic may be used here.) 
2. Use complex arithmetic to determine all zeros of the polynomial 


P(x) = x* — 8x? + 39x? — 62x + 51 


by Muller’s method. 
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10.2. The Lagrangian Representation for Equidistant Abscissast 


In the present section we assume that the points x,, where the values of 
the function fare given, are equally spaced. This is the case, for instance, 
for most mathematical tables. If 4 denotes the distance between two 
consecutive interpolating points, we then have 


(10-7) Xe = No + kh. 

where k = 0, +1, +1, +2,.... We now introduce a new variable s by 
means of the relation 

(10-8) X = Xo + Sh. 


At x = x,, 5 obviously has the value k. The variable s thus measures x 
in units of A, starting at xo. 

We now consider the polynomial P of degree » — m which interpolates 
the function fat the points Xm, X%m41,---,Xn- Here m and n may be any 
two integers such that nm 2 m. (Ordinarily, we have m S$ 0,n 2 0.) By 
(9-4), this polynomial is given by 


P(x) = > Lil foes 


where 


n 
Cla, 
L(x) = ——. 
1S, — Xq 
G#k 


We now express P in terms of the variable s defined by (10-8). Evidently 
x — x, = (s — g)h, 

and in particular, 
Xp — Xq = (k — gyh. 

If P(x) = P(xo + sh) = p(s), we thus have 


(10-9) p(s) = > his 


where 


n 


is) = [177 


q=m 
atk 





The remarkable fact about this representation of the Lagrangian 
polynomial is the independence of the functions /,(s) from h. These 
functions, which may be called the normalized Lagrangian interpolation 
coefficients, depend only on s (the relative location of x with respect to xo 


{ This section may be omitted at first reading. 
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and x,), and, of course, on the integers m and n, which define the set of 
interpolating points. 


EXAMPLES 
1. m=0,n=1. Wehave 





1 — 
ifs) = []5—4 = 1-5, 


QQ 
th It 
oo 


L(s) = ae ee 








aa 
mot | 


oo 


Q 


QQ 


# 
We get, of course, the formula for linear interpolation, expressed as 


P(s) = A — s)fo + fi. 


2. A case which is frequently used in practice is given by m = —1, 
n= 2. Here we find 
_ pp sid _ sis — Ils— 2) _ _s(s — 1s — 2) 
0) = LTT NED 
Oe (s + 1s — 1s — 2) (8 + IG — LG — 2) 
aaa 1(—1)(—2) - 2 
i) = ELDER = 10 - 9, 
1(s) = s+ Use) = |_,(s). 


In view of the fact that the normalized Lagrangian interpolation 
coefficients depend on one continuous variable only, extensive tables for 
them have been prepared (National Bureau of Standards [1948]). Such 
tables take into account symmetry properties such as the relation 
1o(s) = 1,1 — s) noted above. 

We note some interesting algebraic relations between the normalized 
interpolation coefficients. Let us consider the general case, where the 
interpolating points are Xn, Xm+is---»Xn- The polynomial 


P(x) = > Lal 


using n — m+ 1 points, will furnish an exact representation of the 
function f if f is a polynomial of degree n — m or less. Thus it will be 
exact, in particular, for the functions 


ax q 
fey = (* ak g=0,1,....7—-—™m. 
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Since f(x) = s*, we have f, = k*. Hence (10-9) yields the identities in s, 
(10-10) > Lds)\ki = 5%, g=0,1,...,.2-—m. 
k=m 


These identities can be regarded as a system of n — m + 1 equations for 
n — m+ 1 unknowns /,(s). They thus could be used to calculate the /, 
numerically. 

EXAMPLE 

3. Form = 0, n = 2 the relations (10-10) take the form 


L(s) + h(s) +2(s) =1 
L(s)1 + 1(s)2 =s 
1,(s)1? + 1,(s)2? = s?. 


If fis not a polynomial of degree <n — m, the error formula (9-5) still 
stands. In the present situation, 


n 


L(x) = [] @ — x) 


k=m 


and thus 
Ks) = L(xo + sh) = h"-™*1] [ (s — &). 
k=m 


Thus the error formula now appears in the form 


(10-11) f(x» + sh) — p(s) = pene Fe Il (s — k), 


k=m 
where é, is a point between the largest and the smallest of the numbers 
be ieee, 


EXAMPLE 
4. If linear interpolation is used as in example 1, we have forO < s <1 
PCs) — flo + sh) = PES sc — 9), 


where Xp S & S Xj. 


Problems 


3. Use normalized Lagrangian interpolation coefficients to determine 
Jo(2.4068) by interpolation from the following values: 


x Jo(x) 
2.1 0.16661 
2.3 0.05554 
2.5 — 0.04838 


27 — 0.14245 
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4. If the interpolating points are xm, Xm+1,..-, Xn, prove that 
Ls) = lr+m-n + m — 5), k= mM, m+ Locals 
5. Make a general statement about the signs of the /,(s) as a function of 
m,n, k, and s. 
6. If the interpolating points are xo, x1,..., X,, show that 
=(-pr-e{ 5 (ett e : 
L(s) = (-1) (, 3 (4 SHO. Aye Ht BOK: 


7. Assuming that the interpolating points are Xm, Xm+i,..., Xn, find a closed 
expression for the sum 


> EMG). 
k=m 


{Hint: Apply the error formula (10-11) to the function 
f(x) = (% — Xo)" ™*7] 


10.3. Aitken’s Lemma 


We now shall discuss certain algorithms that permit us to construct the 
interpolating polynomial recursively, without reference to the Lagrangian 
formula (9-4). The basic tool is a lemma which enables us to represent an 
interpolating polynomial of degree d + 1 in terms of two such polynomials 
of degree d. 

Some special notation will be required. We again denote the points 
at which the function fis to be interpolated by xo, x1, X2,..., X,, and by 
J, the value of fat x,. We shall have to consider polynomials that inter- 
polate f at some, but not all of the points x9, x,,...,x,. If S is any 
nonempty subset of {xo, x1,..., X,}, we denote by Ps; the polynomial 
interpolating f at those x which are in S. Thus, if S contains k + 1 
points, P, is the unique polynomial of degree < k such that 


P(x) = fi, x, ES. 
EXAMPLES 


5. If S contains just one point x,, then P; = fj. 
6. Pix, ,29,25) denotes the polynomial interpolating at the points x,, Xo, Xs. 
Denoting by W the set of all interpolating points, we can state the 
following lemma: 
Lemma 10.3 Let S and 7 be two proper subsets of W having all 
but the two points x,¢ Sand x;¢€7in common. Then 
(10-12) Psy r(x) = (x; a x)P7(x) + (x; — x)Ps(x) 
Xj are x; 
identically in x. 


Here, as usual, S U T denotes the union of the sets S and T. 
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Proof. Let the sets S and 7 contain m+ 1 points each. Both poly- 
nomials P, and P,; interpolate at m + 1 points, hence are of degree < m. 
Denoting the expression on the right of (10-12) by P, we see that P has a 
degree < m+ 1. Hence if we can show that P interpolates at all points 
of SU T, then theorem 9.1 implies that P = Ps yr. 

Let x, be a point of the intersection SA Tof Sand 7. By virtue of 


Ps(X,) as P7(Xx) = Tis 
(10-12) yields 
P(x,) = C2 = Able = Os — Hed 


xX,— X; 
> Sv 
as desired. For x = x; we have 
(x; — x;)Ps(%j) = 


P(x) = Res ce = fi, 
and similarly for x = x; 
X, — X) Pox; 
P(x;) — ( i) r( i) = f,. 


xX, — X; 


Thus P has been shown to interpolate at all points in S U 7, completing 
the proof. 


EXAMPLE 
Fi If S = {Xo, a, T a {Xo; xs}, we obtain 


Peng. a,ng(X) = £07 Peon a2) — Os = *)Prao, 2), 
Xo = Xs 


8. Lemma 10.3 is already familiar if the intersection SMT is empty. 
We then have, using example 5, 


(x; — X)Px4)(X) =O = X)P (x,)(X) 


Pex aX) = a 
ey 
Ce acres 
Xi, — X; 


This is the familiar formula for /inear interpolation. 


Lemma 10.3 can be used in two ways. We may use it to get a formal 
representation for the interpolating polynomial, or we may use it to 
calculate the value of the polynomial for a given value of x. In the latter 
case formula (10-12) requires dividing a sum of products by a single 
number, an operation that can be performed on a desk computer without 
writing down intermediate results. 
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Problems 


8. Obtain the Lagrangian formula for quadratic interpolation on the set 
{Xo9, X1, X2} from the formulas for linear interpolation on the sets {xo, x1} 
and {Xo, Xz}. 

9. Prove the following generalization of lemma 10.3: Let S be an arbitrary 
subset of {Xm+1, Xm+25---, Xn} and let S, = {x,, S}, k = 1,2,...,m. If 
L,(x) (k = 1,2,..., m) are the Lagrangian interpolating coefficients for 
interpolation on the set {x,, X2,..., Xm}, then 


Pus,(x) = >: L(X)Ps,,(x). 
K=1 


10.4 Aitken’s Algorithm 


Lemma 10.3 enables us to generate the interpolating polynomials of 
higher degrees successively from polynomials of lower degrees. It still 
leaves us considerable freedom in the choice of the sets S and T used to 
finally obtain the polynomial Py. Two standardized choices have become 
widely used, one named after Aitken, the other named after Neville. In 
both choices a triangular array of polynomials P, 4 is generated. Here 
P,,.4 is a certain polynomial of degree d that interpolates on a set of d + 1 
points depending on k. Ajitken’s scheme is as follows: 


Algorithm 10.4 For d =0,1,...,”, generate the polynomials P,, g 
as follows: 


(10-13) P,., o(X) = fis k = 0, l, ar) ny 


(10-14) Py asa(x) = Gi — Pa ol) — Cha — 20 9, 


k=d+1,d + 2).0..,.%. 


The arrangement of the polynomials P,, g is shown in scheme 10.4. 

















d 0 1 2 wee n 

Xo Poo Xo — XxX 
xy Pi. Pi XxX, — Xx 
X2 Poo Poi Poe Xg — xX 
Xk Pi.0 Pra Pie Xp — Xx 
Xn tees Pia Pro ee Pix Xn 7X 


Scheme 10.4 
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The doubly underlined entry in scheme 10.4 is obtained by crosswise 
multiplication of the simply underlined entries. 


Theorem 10.4 In the notation of lemma 10.3, 


Pra = | sf ae Re Pea 


a€=0, lecgnt k= dd +:1,..5, 7: 


Proof. We use induction with respect to d. By (10-13), the assertion is 
true for d= 0. Assuming it to be true for some d = 0, lemma 10.3 
shows that the polynomial defined by (10-14) interpolates on the union of 
the sets {x9, X1,..., Xg—1, Xg} and {xo, X1,..., Xg-1, X;,} that is, on the set 
{Xo X1)---5 Xa, X,} proving our assertion for d increased by one. For 
d= k =n we obtain 


Corollary 10.4 P,, = Py. 


The rightmost entry in scheme 10.4 is the polynomial that interpolates 
on the set of all points x9, %1,..., Xn 


EXAMPLE 


9. Let f(x) = x*. We wish to calculate f(3) by interpolation at the 
points —4, —2,0, 2,4. Scheme 10.4 looks as follows: 


Xk Pro Pri Pro Px.3 Pra X_y — xX 
—4 256 —7 
—2 16 — 584 —5 

0 0 — 192 396 —3 

2 16 —24 116 —24 —1 

4 256 256 116 186 81 l 
Problems 


10. Use Aitken’s algorithm to obtain a value of sin 7/4 from the following 
values of the function f(x) = sin x7/2: 


x 2 —1 0 1 Z 3 
I(x) 0 —1 0 1 0 —1 


11. Use algorithm 10.4 to determine J,(2.4068) by interpolation from the 
values given in problem 3. 


10.5 Neville’s Algorithm 


In Neville’s use of lemma 10.3, the polynomials P,, , are built up in such 
a manner that each polynomial interpolates on a set of points with d + 1 
consecutive indices. The algorithm is as follows: 
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Algorithm 10.5 For d=0,1,...,”, construct the polynomials 
P,,.q¢ aS follows: 


(10-15) Pi. AX) = fos k=O) losam 


(x), — X)Px—1,al¥) — Oe-a-1 — X)Px, a(x) 


Xe Xe-ad-1 


k=d+1,d+2,...,n. 


(10-16) Py a+1(X) = 


The arrangement of the polynomials P,, , is the same as in scheme 10.4, 
but the doubly underlined entry is now computed from asymmetrically 
located entries as shown in scheme 10.5: 


d 0 1 2 tee n 

Xo Poo Np 
X1 Pio Piya Xj; — xX 
X2 P20 Poi Poo Xg —- x 
Xh-2 Pi.~2,0 Py-21 Pi. 2,2 Xy~-2 — %X 
Xk-1 Pr-1,0 Pesa4 Pr-1,2 Xp-1 7X 
Xk Pro Pry Pro Xp TX 
Xn Po | ee Pa. Pe oe Xn — X 

Scheme 10.5 


Theorem 10.5 In the notation of lemma 10.3, if the polynomials 
P,,.q are generated by algorithm 10.5, 


Pra = ee 1L,-se Ueh> 


d=0,1,....n; k=d,d+1,...,n. 


Proof. By (10-15), the assertion is true for d= 0. If true for some 
d = 0, then it follows from (10-16) by virtue of lemma 10.3 that P, 444 
interpolates on the union of the sets {x,~¢ 1, Xx-a)---)Xe-1} and 
{Xi as Xk-a41y +++) Xx}, Le., on the set {x,_~¢-1, Xx-a,---» X,}, proving the 
assertion with d increased by one. 

For d = k = n we have, in particular 


Corollary 10.5 P,, = Pw, 


thus again, the rightmost polynomial in Neville’s scheme is the desired 
polynomial interpolating at all points Xo, X%1,..., Xp. 
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EXAMPLE 

10. We again consider f(x) = x* and calculate f(3) from the values at 
x = —4, —2, 0, 2, 4, using Neville’s algorithm. The following scheme 
results: 





Xx Peo Pry Pio Pr. Py 4 Xy 7 X 
—4 256 —7 
—2 16 — 584 —5 

0 0 —24 396 —3 

2 16 24 36 —24 —| 

4 256 136 108 96 81 1 
Problems 


12. Which entries in the schemes of the polynomials P,.4 generated by the 
algorithms 10.4 and 10.5 are necessarily identical? 


13. Find an approximate value of V2 by interpolation, using Neville’s 


algorithm, from the values of the function f(x) = 2* at the points x = —2, 
—1, 0, 1, 2, 3. 
14. Calculate an approximate value of the infinite series 


1 1 1 
I+st+art gto 


in the following manner: Let 


1 1 1 

f(t) att gett = 1.2 3h) 

and calculate f(0) by extrapolation from f(1), f(4), f(4),. .., using Neville’s 
algorithm. 


10.6 Inverse Interpolation 


Interpolation (approximately) solves the problem of finding the value of 
y = f(x) when x is given. It does not solve the problem of finding x 
when y = f(x) is given. (We could, of course, replace f by the inter- 
polating polynomial P and solve the equation y = P(x) for x, but in doing 
so we would merely replace one problem by another problem of com- 
parable difficulty.) The problem can be easily solved, however, by 
interchanging the roles of x and y. Speaking abstractly, this amounts to 
interpolating the inverse function f'~™ instead of f itself. Speaking 
concretely, it means interchanging the roles of the x, and the f,. Since, 
even for equidistant x,, the corresponding values f,, are not equidistant, 
it is essential that we are able to calculate the interpolating polynomial for 
nonequidistant interpolating points. As an example, we consider the 
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problem of solving f(x) = 0 when the function f is known at n+ 1 
distinct points x,. If the polynomial P(y) interpolating the inverse 
function at the points f(x,,) = f, is constructed by Aitken’s algorithm and 
evaluated at y = 0, we obtain the following algorithm: 


Algorithm 10.6 Let 
Xmn.o = Xm (m = 0,1,..., 7) 
and form for n = 0,1,...,m — 1 the numbers 


TaN ncn —Jndmin 
Tn eee En 


The approximate solution is given by X,,,. The arrangement of 
the triangular array of the values YX,, , is as follows: 


So Xo, 0 
fi} X10 Xia 
to Xo,0 Xo,1 X2,2 


, eee = 


i Xn,0 Xn.1 Xn,2 ost Naas 


Inverse interpolation is possible only if, in the range where interpolation 
is used, x is a single-valued function of y. In the example depicted in 
figure 10.6, where this condition is not satisfied, the interpolating poly- 
nomial bears no relationship to the inverse function. 

The error of inverse interpolation obeys the same laws as the error of 
ordinary interpolation. It depends on the derivatives of the inverse 
function f'~*(y). These derivatives can be calculated, in principle at 
least, from the derivatives of the function f. Differentiating the identity 


FPN) = x 
we obtain 
(10-17) SM FOS'C) = 1; 


hence 
= fiayy — 1 
F* : (f(x)) ~— f(x) 


Higher derivatives can be obtained by repeatingly differentiating (10-17). 
For instance, 


FM FEL OOP + FV FOYS'"G) = 0 
shows that 
Sey 
Fixy 





fow"(fQ)) = 
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Figure 10.6 


This process can be continued, but the results become more and more 
complicated, even if the derivatives of f are simple. 
Problem 


15. Using inverse interpolation, find an approximate value of the second zero 
of the Bessel function Jo(x) if the following values are given: 


x Jo(x) 

3.2 — 0.1102904 
5.4 — 0.0412101 
5.6 0.0269709 
5.8 0.0917026 


10.7 Iterated Inverse Interpolation 


Solving an equation f(x) = 0 by simple inverse interpolation as 
described in example 11 is appropriate if the function fis known only at a 
set of discrete points x (e.g., if fis a tabulated function). If f(x) can be 
calculated for arbitrary x, it is possible to test the result of the inverse 
interpolation procedure by evaluating f at the interpolated value of x. 
In general, f(x) will not be exactly equal to 0. In this case, f(x) and x are 
introduced as new entries in the interpolation table, and a new row of 
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values of X is calculated by inverse interpolation. The Neville form of 
the interpolation table is especially appropriate here, because in it already 
the first entries in a new horizontal row are good approximations to the 
desired value of x. Beginning with two values of f, the Neville scheme is 
continued systematically row by row as follows: 


Algorithm 10.7 Choose x, and x,, and let 


fo =F (Xo), Xo,0 = Xo; 
(10-18) fi = f(y), X10 = X1, 
_SfiXo.0 — foXi,0 

i eee 


Then form the triangular array of numbers X,,, (m = 2,3,...; 
n = 0, 1,...,m) by means of the relations 


Xn,0 = Goer et I = f(Xn.0)s 


Sind n — fin-n-1Xm n 
10-19 Xn. = OOo 
( ) ate Sei finned 


n=0,1,...,.m—1. 


If f is sufficiently differentiable in a neighborhood of a solution s of 
F(x) = 0, if f(x) # 0, and if x» and x, are sufficiently close to s, it is 
intuitively clear that the numbers X,,,, converge to sforn—> ©. 

As described above, the table of the values Y,,,, extends farther and 
farther to the right with every new row. Ultimately, little extra accuracy 
will be gained from the entries far to the right, because they depend on 
values X,, 9 with small n which presumably are poor approximations to the 
desired solution. It is therefore advisable not to increase the degrees of 
the inverse interpolating polynomial beyond a certain degree d (say 
d = 2 or d = 3), that is, to truncate the table of values X,, ,, after the ath 
column. This means that the formulas (10-19) are only used for m S d, 
for m > d they are to be replaced by the following: 


Xn, 0 = Amo tcds ten = f(Xn,0)s 


Xin ~ te, i= Oi Neneked = 1, 


The convergence of this modified version of algorithm 10.7 is, under 
suitable conditions, proved by Ostrowski ([1960], chapter 13). 

The case d = 2 of the modified algorithm 10.7 is very similar to Muller’s 
method discussed in §10.1, except that now the inverse function rather 


(10-20) 
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than the function itself is interpolated by a quadratic polynomial. From 
the computational point of view it would even seem to be superior to 
Muller’s method since it does not require the evaluation of square roots. 
However, just for this reason it lacks the advantage of automatically 
branching off into the complex domain if no real zeros are found. 


Problems 


16. Using repeated quadratic inverse interpolation, find the root of the 
polynomial 
P(x) = 70x* — 140x? + 90x? — 20x + 1 


located between 0.6 and 0.7. (Use Horner’s scheme to evaluate the 
polynomial.) 

17. Show that iterated /inear inverse interpolation is identical with the regula 
falsi (see §4.11). 


Recommended Reading 


The practical aspects of interpolation are dealt with in a volume issued 
by the Nautical Almanac Service [1956]. The theory of a number of 
processes for solving f(x) = 0 by methods based on interpolation is dealt 
with very thoroughly by Ostrowski [1960]. 


Research Problems 


1. How can the regula falsi be extended to the solution of systems of 
more than one equation? (For some pertinent remarks, see Ostrowski 
[1960], p. 146.) 

2. Develop a theory for interpolating functions of two variables by 
bilinear polynomials of the form a + bx + cy + dxy. 


chapter l | construction of the interpolating 


polynomial: methods using differences 


The representations of the interpolating polynomial discussed in chapter 10 
were based directly on the values of the interpolated function. They do 
not convey any information, explicit or implicit, concerning the error of 
the interpolating polynomial. In this respect, the methods to be discussed 
in the present chapter do somewhat better. These representations are 
based on differences of the sequence of function values, and on certain 
properties of binomial coefficients. 


11.1. Differences and Binomial Coefficients 


Differences of a sequence of numbers were already defined in §4.4 and 
$6.9. We now introduce differences of a function f defined on a suitable 
interval. Let h > 0 be aconstant. The function 4f whose value at x is 
given by 


Af(x) = f(x + h) — f@) 


is called the first (forward) difference of the function f. It obviously 

depends on the step A, although this fact is usually not made evident in the 

notation. Higher differences are defined inductively by the relation 
A*f = A(A*~'f), ote as Pree 

For instance, 


A*f(x) = f(x + 2h) — 2f(x + h) + f(x). 


For symmetry we put 


Ay =J- 
214 
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An induction argument entirely analogous to that employed in the proof 
of (6-27) shows that 


A(x) = fle + kh) — (T)fle + e = 1) 


+ (3) Fe + (k — 2)h) +---+ (—1)*f(>). 
If, for integral k, x, = X9 + kh, and if we write 


I(%) = Sis 


the differences thus introduced produce the same result as the difference 
operator 4 introduced in §4.4, if the latter is applied to the sequence of 
values {f,,} of f. 

It is to be expected that the differences of a function share many 
properties, and have many connections, with the derivatives of the func- 
tion. For instance, the mean value theorem of differential calculus states 
that, for some € between x and x + A, 


Af(x) = Af’(é). 


We shall soon become acquainted with a generalization of this relation to 
differences and derivatives of arbitrary order. 

In differential calculus, a set of functions enjoying particularly simple 
properties with respect to differentiation is the set of monomials x"/n!}, 
n=0,1,.... In fact, 


In difference calculus, an analogous role is played by the binomial 
coefficients 


(11-2) (s) = = Beant), 


n n! 


Here s is any real (or even complex) number, and z is a positive integer. 
For n = 0, the symbol (11-2) is defined to be 1, for negative integers 7, 
zero. It is always understood in the following that the operator 4 acts on 
the variable s, and that the step A = lisimplied. With this understanding 
we have 


(11-3) 4(°) zs Oe i): 
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This is trivially true form < 0; forn > 0 we have 


a= (5 )-G) 
Ie Di Cae eee mad) 


n! 


io Re Se be Se Daa 2) 


n(n — 1)! 
7 hi : 1) 


as desired. By induction it follows immediately from (11-3) that 


1- A* 3 => . ) >= #2 ii oom 
(11-4) (\=(,2 4) = 912 

Another property of the monomials x"/n! also carries over to the 
binomial coefficients. It is trivial that every polynomial of degree n can 
be expressed as a linear combination of the monomials 


; x 2 
TP al 


Similarly, such a polynomial can also be expressed as a linear combination 


. 6.) 


In fact, asomewhat more general statement istrue. If {a,,a@2,..., a,}iS an 
arbitrary set of real numbers, then a polynomial of degree n is expressible 
in terms of the generalized monomials 

xta, (x+a)? (w+a,)" 


=. @aaa > oe 


I, 1! 2! n! 





This fact has the following analog: 


Theorem 11.1 Let a), a.,..., a, be arbitrary real numbers, and let 
p be any polynomial of degree ». Then there exist constants Ao, 
A,,..., A, such that 


(11-5) p(s) = Ay + A,(® : *) ce A,(® o7) eon Anl® A 


identically in s. 


Proof. Evidently the statement of the theorem is true for n = 0; we 
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proceed by induction with respect to m and assume that the theorem is 
true for some nonnegative integer n — 1. If 


P(s) — bos” + bist “te abies + b,, 
then the polynomial 


(11-6) qs) = p(s) - bont(® * **) 


is of degree n — 1, since the leading coefficient of 


S+ta\ . s 
IS —: 
n n! 


By the induction hypothesis, g can be represented in the form 


q(s) = Ao + A,(° ie $oet A 


Solving (11-6) for p(s) we obtain a representation of the desired form 
(11-5), where A, = bon! 
EXAMPLES 


1. A special case of theorem 11.1 was used in example 13 of chapter 6, 
where we obtained the formula 


past yesG) = G) 


2. The truth of corollary 6.8 (which was not proved in §6.8) follows from 
the special case a, = a, =:--= a, = 0 of the above theorem 11.1. 


Problems 


1. Determine all functions fthat are defined on the whole real line and satisfy 
Af(x) = hef(x) 
identically in x, where c is a constant. 

2. If f and g are two differentiable functions, find a formula for 4(fg) and 
derive the product rule for differentiation. 

3. Formulate an algorithm for obtaining the representation (11-5) for a 
given polynomial p(s) = bos" +---+ 5, and for given constants 
a1, Q2,...,@,. (Determine A, first.) 

4. Represent the polynomials p(s) = s" (n = 1,2,...) in the form (11-5), 
where a; = 15. 2504 < 

5. Find a closed expression for the differences 


A*f,, k = 0, be 2 ates 
if f(x) = e* and x» = 0. Show that in this case 
lim h~* A*fy = f’(0). 
h-0 
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11.2 Finalized Representations of Sequences of Interpolating Polynomials 


We now return to the problem of constructing the polynomials inter- 
polating a function f on a given set of points. The interpolating points 
are a set of equidistant points 


Xx — Xo + kh, 
where the integer k may be positive or negative. As usual we write 


Se = I (Xx): 


It will be convenient to express the interpolating polynomial in terms of 


the variable 
(11-7) Ss 


_ xX — Xo 


h 


Thus, if S is a set of interpolating points x, and Ps; denotes the polynomial 
interpolating f at the points of S, we shall define p by 


P(s) = Ps(x) = Ps(Xo + Sh). 
The polynomial p is characterized by the property that 
p(k) =f, whenever x,€S; 


for nonintegral values of s, p(s) is to be regarded as an approximation to 
St (xo + sh). 

Actually, we are now not merely interested in constructing a single 
polynomial, interpolating f on a single set S. Instead, we shall try to 
determine sequences of interpolating polynomials that interpolate f on a 
sequence of sets So, Si, So,... of interpolating points. These sets are 
defined as follows: 

Let mo, m,, mg,... be a nonincreasing sequence of integers such that 


(11-8) m, — 1S ms. Sm, 
for all kK = 0,1, 2,..., and let 
Si = ae Xmytls+++s Xm hs 


k =0,1,2,.... The set S$, thus contains precisely k + 1 consecutive 
interpolating points, beginning with the point x,,. By virtue of (11-8), 
each set S; contains the preceding set S,_, and thus a// preceding sets 
Si ~95 ee eg So. 


EXAMPLES 
3. Letm, =0,k =0,1,2,.... Then 


Si = {Xo, X15 0 65 X;}. 
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4. Ifm, = —k, k =0,1,2,..., we have 


Sic = {Xe Kot ay + +> Xo}: 
5. Setting m) = m, = 0, m. = mg = —1,..., we obtain the sets 
Sor = reeks X_ktly- ses Xids 
Sone1 = {Xs Xess o> Meads k=0,1,2,.... 
6. Setting m) = 0,m, = mz, = —1,m3 = m, = —2,..., the sets S, are 
given by 
Son = {XH ks Xen tis +++ Xihs 


Sota = {Xk 1s Kade oy Nigh 


By the fundamental theorem 9.1, there exists, for every k = 0,1, 2,..., 
a unique polynomial Ps, of degree <k such that 


Ps,(Xm) = Tis Xm E Sy 


If x and s are connected by (11-7), we shall write 
PS) = Ps,(x). 

We now wish to consider the following problem: Given a sequence of 
integers {m,} satisfying (11-8) (and hence a sequence of sets So, S;,...), 
determine two sequences of réal numbers {a,,} and {A,} such that, for every 
7 ea 0 ea (eee 


(11-9) p(s) = Ao + A,(® : =) oe Anl° a. 


identically in s. 

It is not clear at all that this problem has a solution. Theorem 11.1 
merely tells us that, for an arbitrary sequence {a,} and for every fixed 
integer n, constants Ao, A;,..., A, can be found such that (11-9) holds. 
It is to be expected, however, that if n is replaced by n + 1, the constants 
Ao, Ay,..-, A, already found will have to be replaced by other constants. 
If we wish to obtain a “finalized” representation of the sequence {p,} 
with the property that the constants A,, once determined, remain un- 
changed, we can hope to do so only by a judicious choice of the sequence 
{a,}. 

EXAMPLE 

7. Let a, =n, n=1,2,.... The function f(s) = s is interpolated at 

s = 0 by p(s) = 0. This is a representation of the form (11-9) with 

A, = 0. In order to interpolate fat s = 0 and s = 1, we must take 
Pils) =s = —-1 4+ ie ') 
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The last expression is again of the form (11-9), but we now have A, = —1. 
The coefficient Ay = 0 is preserved if and only if we choose a, = —1. 


Let us now investigate the properties which the sequence {a,} must have 
if the above problem is to have a solution. For an arbitrary integer 
n = O, consider the two polynomials 


Pols) = do + (SM) eet A * 


l 
and 
S+a S+ a, sta, 
Pasils) = Ao + Ai ] ‘) eet Aa n ) Anal’ SO) 


The polynomial p, interpolates on the set S,, p, +; on the set S,,,. Since 
S, 18 contained in S,,,, both polynomials interpolate on the set S,. 
Both thus have identical values for x, in the set S,. This means that the 


last term 
S+ a, 
Ans n ae a, 


must vanish whenever s is equal to one of the integers 
(11-10) Mn, M, + 1,..., mM, +N. 


If fis such that p,., has degree n + 1, then A,,, 4 0, and the required 
condition is 


St Ansi\ _ 
(11-11) (aan: )=0 


for all said integers. The binomial coefficient in (11-11) 1s zero if and only 
if s is one of the numbers 


Anti, Ang. + 1,3 sey Anti TM. 


Evidently the set of these numbers coincides with the set (11-10) if and 
only if 


(11-12) An+1 = —M,, =O De on Su 


This condition fully determines the sequence {a,, az,...} as a function of 
the sequence {m,}. 

There remains the problem of determining the constants A,. Fora 
fixed value of n, there certainly exist, by theorem 11.1, constants Ao, 
A,,..., A, such that 


Pals) = Ay + A) bet Ag): 
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The question is whether these A, are independent of n. In order to 
determine A, for 0 $ k Sn, we form the kth difference of p,(s). By 
(11-4), we get 


(11-13) A*p(5) = An Anea(7 7) tet Aa? DE: 


In this identity we set s = m,. The values of p involved in forming the 


difference 4“p,(s) then are the values p(s) fors = m,,m, + 1,...,m, + k, 
i.e., those values of s for which x, ¢S,. Since S,,c S,, we have 
PAS) =f; 


for these values, and hence 


A*p,(m,) = rag See 


In the expression on the right of (11-13), all binomial coefficients are zero 
for s = m,, as it follows from (11-8) that 


Os m, — mi, 21, P01 2d exe 
We thus obtain 
A,, = A" fs 


and it turns out that A, is independent of n, the degree of the interpolating 
polynomial, as we had hoped. The polynomials p, solving the problem 
posed initially thus are given by 


_ S— My-1 
(11-14) Pal) = > AY," ~ 
They can be generated recursively by the following simple algorithm: 


Algorithm 11.2 If {m,} is a sequence of integers satisfying (11-8), 
let po(s) = fn,» and for n = 0, 1, 2,..., 


: = n+1 oe HiT ‘ 
(11 15) Pn+i(S) PAS) +4 Palle ze 7) 


By construction, we have 


Prlk) = f(Xx)s 


k =m, m, + 1,...,™, +n. We wish to find an expression for the 
error of p,(s) if s is not equal to one of the above values of k. This is 
easily possible in our new notation. According to theorem 9.2 the 
difference 


F(x) — Ps, @) = £0) — Pals) 
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can, if f has a continuous derivative of order n + 1, be written in the 
form 


(x _ Xm, (x oF Kneis : (x — Ximg +n) n+1 
ete i me (E); 


where €, is some point in the smallest interval containing x, xm,, and 
Xm, +n We have 


X — Xn, +n = Als — (m, + k)], as (an ere: B 


The product of the factors x — xn,4, appearing above thus can be 


written 
(5 — ") 
(n + D1, a 


and we obtain 


Theorem 11.2 Let the function f have a continuous (n + 1)st 
derivative on an interval containing the points x = x9 + sh, Xm,, and 
Xm,+n- If p,(s) is defined by algorithm 11.2, then for a suitable 
point ¢, of that interval 


(1-16) fl) = pals) + rapes"): 


A comparison of the equations (11-15) and (11-16) shows that the 
correction term which has to be added to p,(s) in order to obtain the exact 
value of f is of the same form as the term that has to be added in order to 
pass from p, to p,,1, with the exception that 


Artif, _ is to be replaced by Antifa+vc~ y, 


If the function f‘*” does not change very rapidly, these two terms are of 
the same order of magnitude (see the problems 11 and 12). One can thus 
say with some justification that the error of p, is of the order of the first 
omitted term in the sum (11-14). Thus, if the degree of the interpolating 
polynomial is not fixed beforehand, one may hope to obtain an accurate 
representation of f (to within rounding errors) by extending the sum 
(11-14) through such a value of that the omitted terms are insignificant. 


EXAMPLES 

We shall construct the sequences of interpolating polynomials corre- 
sponding to the sequences {m,,} considered in the examples 3, 4, 5, and 6. 
8. For m, = 0, = 0,1, 2,... we obtain the polynomials 


Pals) = So + (5) dfo + (5) 4%fo +--+ (7) 4% 


interpolating on the sets {xo, X1,..., Xn}- 
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9. Form, = —n,n = 0,1, 2,... we obtain 


s+1 


Pals) = fo + (7) Af-a + ( apse e FEO 


n 


A*f_,. 


These polynomials p, interpolate on the set {x9, ¥_1, X_2,..., X_n}. 
The formulas obtained in examples 8 and 9 are known, respectively, as 
the Newton forward and the Newton backward formula. 
10. Letting 
{Mo, 1, Mo,...} = {0,0, —1, —1,...}, 


we get the polynomials 


Pals) = fo + (1) Afo + (5) 2F-1 


ce cael ae oe 


(n + 1 terms) 
interpolating on the sets 


Beek ese an x} for n = 2k 
and 
{Xoies Xe k+1s- Parry Xeaat for a= 2k + Li 


11. Taking {Mo, mM, Mo, .-- +S = {0, or 1, ae I 2s =e; <% is we obtain 
the polynomials 
st+l 


Pals) = fo + (7) Af + (75 ') a%f-1 
rs oe ') 4°%f-2 + (4°) 42 +e 
(n + 1 terms). 


They interpolate on the sets {x_,, X_x441,.--, X,} for nm = 2k and 
{X-15 X- ny +) Xe} for n = 2k + 1. 

The formulas obtained in the examples 10 and 11 areknown, respectively, 
as the Gauss forward and the Gauss backward formula. 


Problems 


6. Forming differences of the values of the function J, given in problem 15, 
chapter 10, find J,(5.5) by the four interpolation formulas given above. 


7. Using the fact that 
S gpm sel 
Cie) 
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and expressing forward differences by backward differences (see §6.9), 
show that the Newton backward formula can be written in the form 


pals) = fo - (7°) fo + (5°) 9% — + O(G)O% 


8. Expressing ordinates in terms of differences. Establish the formula 


S[k 
fe= > ("ante k = 0,1,2,... 


m=0 


(a) by induction; (b) by considering it as a special case of the Newton 


formula. 
9. Using the fact that the identity of problem 8 must hold for arbitrary values 
of fo, fi,..., show that for arbitrary integers n and kA suchthatO Sn Sk 


ZOU) a dak 


10. Study the convergence of the infinite series 


Zeal) 


(Newton’s formula extended to infinitely many terms), where f(x) = e*, as 
it depends on x andhk. (Use problem 5 and apply the ratio test.) 

11. Let f be » times continuously differentiable on a suitable interval. Show 
that for some € € (Xo, Xn) 


Arfy = hr f™(E). 
[Differentiate the function 
ORES) As sia MSM 
fis) -— > ({) 4%» = 458 


n times with respect to x and apply Rolle’s theorem.] 
12. As an application of the preceding problem, show that 


lim h-" Afo = f™(X0) 
h-70 


for any sufficiently differentiable function f. 

13. Assume that the values of f, are known only up to rounding errors ep, 
where |e,| S «. Show that the maximum error in A*f, can be as large 
as 2*e, 


11.3. Some Special Interpolation Formulas 


In spite of their basic simplicity the interpolation formulas of Newton 
and Gauss given in §11.2 are not frequently used in practice, mainly 
because of their lack of formal symmetry. More frequently used in 
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practical interpolation are certain formulas named after Stirling, Bessel, 
and Everett. 


The elegant formulation of these formulas requires the introduction of 
two new operators » and 6 in addition to the forward difference operator 
A. The operator yp is defined by 


(11-17) f(x) = ; [r(x ss 5) + f(x — 5) 


and consequently is called the averaging operator. The operator & is 
defined by 


(11-18) Sf(x) = s(x Ht 3) ~ f(x _ 3) 
and is called the central difference operator. Note that 5 can always be 
expressed in terms of 4, and vice versa. For instance, 
Fa fala RPO as 2 k = 0,1, 2,.... 
A further identity to be noted is 
Hdfo = H(4f_1 + Mfr) = 4 — f-1). 


Stirling’s formula. For a given even integer n = 2k both the Gauss 
forward and the Gauss backward formula yield the interpolating poly- 
nomial corresponding to the set {x_,,X-441,-.+, X,}. Thus also their 
arithmetic mean must yield the same polynomial. The resulting 
polynomial 


pasts) = fo + (1) 5 (Af + Af] 
1) C3 Yer eC) arse ars 


lf/stk—1 S+kK\] jon 
45 ( 2k )+ Co )| 47 
can by virtue of the identity 
st+tk—1 s+k s(stk—1 
( 2k ts oer) ORT Enea) 
be written in the form 
(1-19) pals) = fo + (1) [H8fo + 5 82h) ++ 


stk = 1 Qk~1 a P| 
+( ae Ne fo + 3. 8). 
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This formula, called Stirling’s formula, expresses the polynomial inter- 
polating at an odd number of equidistant points in terms of central 
differences at the center point. It is preferably used for interpolation 
near that center point. 

Bessel’s formula. If n is odd, n = 2k + 1, the Gaussian forward 
formula at x) and the Gaussian backward formula at x, interpolate on the 
same set of points Soi. = {Xn X-n41)-++) Xk+1}. The Gaussian 
backward formula centered at x, is given by 


+ (5 +k 


perl) =fit (4) Afo+ (°F Sat +--+ (34 4) ato 


where 


> ¥— kX XX — Xk = 
Se ee } 





Averaging this expression with the Gaussian forward formula at xo we 
get, after some simplification, 


(11-20) Porsils) = efie + (8 — 4) Sire 


+ (5) 8%i0 + 46-2) Oia] + 


+k—-1 1 
(OT [od hie + a - DO hia]. 


This formula is known as Bessel’s formula. It expresses the interpolating 
polynomial for an even number of consecutive points in terms of central 
differences. It is preferably used for interpolation halfway between the 
two center points. 

Everett’s formula. We start from Gauss’ forward formula, where n is 
odd, n = 2k + 1. Eliminating differences of odd order by use of the 
formula 


APN fe = AF eg — afk, 


we obtain 


Par+ils) = fo + (‘i — fo) + () A*f_, 


A*f_, 


2 rs a ye fe 7 as a 
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Collecting equal differences, expressing them in terms of the central 
difference operator and using the identity 


( + 
2k ] ; 


=l-—-+s5, 


where 


we obtain the formula 


(11-21) paves) = ({) fo + (5 *) Bf +--+ (4) Om 


+ ()a+ (5) tt (4) Oe 


due to the British astronomer Everett. 

Of the three formulas given above, the highly symmetrical and elegant 
formula due to Everett has found great favor in practice. It has the 
advantage of using only differences of even order, and of furnishing an 
interpolation polynomial whose degree (and consequently, accuracy) is 
higher than the order of the highest difference employed. For instance, 
with a column of second differences alone we can calculate the cubic 
interpolating polynomial, whereas the application of all other formulas 
requires three difference columns. For these reasons many tables of 
higher transcendental functions, if they give any differences at all, give 
second differences only. No special tables of the Everett interpolation 


coefficients 
stk 
Ga i): a A ae 


are required, as these coefficients are identical with the extreme (first 
and last) Lagrangian interpolation coefficients interpolating on the set 
X key Xe etiy s+ +s Near (See §10.2). 


11.4 Throwbackt 


Throwback of higher differences into lower differences is an extremely 
simple but ingenious device, due to Comrie, which enhances the accuracy 
of interpolation formulas without increasing the required numerical 
work. The idea is quite general; we explain it in the simplest possible 
situation. 


t This section may be omitted at first reading. 
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Everett’s formula for the interpolating polynomial of degree 5 may be 


written 
Ps(s) = fo + (' : ') {8% an Gr e~? atfoh 


plus a similar term with ¢ replaced by s and the subscript 0 replaced by 1. 
In the interval 0 S ¢ S 1 the factor 


(1+ 2-2) 2-4 
4-5 = 


varies between the narrow limits —(4/20) and —(3/20). Thus if we define 
modified second differences 52/* by the formula 


576 = 8*f;, - 40 6 Of, 


then Everett’s formula with fourth differences can be approximated by the 
formula 


ps) = fo + ("5 ') oye fit (75 |) oot 


where t = 1 — ss. This formula can be used in the same way as the 
formula involving second differences only. The error committed in 
replacing ps by ie is aaa to 


eo sa 4] sep 4 ee) 2-4 = sap 
= 7 “Ba- 24°)( a | se 28%)(" 5 a ' aif | 


For 0 S$ s S 1 (and consequently 0 < ¢ S 1) this turns out to be less 
than 


0.00122 max {|8*f.|, |8¢f;|}. 


Thus if one unit in the least significant digit carried in the computation is 
denoted by u, we have 


| p&(s) — pe(s)| S 


already if the fourth differences are less than 400u. 

Many mathematical tables giving second differences print modified 
instead of ordinary second differences. In forming the modified differ- 
ences, the factor —(7/40) is frequently replaced by —0.184, a value 
suggested by Comrie from a consideration of Bessel’s formula. Tables 
with modified second differences make it possible to calculate (with an 
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error of less than one round-off error) the fifth degree interpolating poly- 
nomial from only one difference column, with an amount of work 
comparable to that for the third degree polynomial. 


EXAMPLE 

12. According to the tables of Jahnke and Emde [1945], x = 11.620 isa 
zero of the Bessel function J.(x). We check this statement by evaluating 
J.(11.620), using the following values given in the British Association 
tables: 


x J2(x) §2* 
11.4 0.05118808 
11.5 0.02793593 
11.6 0.00461559 15622 
11.7 —0.01854910 37729 
11.8 —0.04133747 
11.9 — 0.06353402 


With x) = 11.6, we have s = 0.2, t = 0.8 


a ) ae 2. ae 
( : ) = 0.032, ( ; )= 0.048, 


yielding J.(11.620) = 0.00003692. The derivative of J.(x) satisfies 
J2(x) = Z[Ji(x) — Ja(x)] 


and thus, again from tables, has near the suspected zero the approximate 
value —0.23. From Newton’s formula, we thus expect 


— 0.00003692 


to be a more accurate value of the desired zero and indeed, Watson [1944] 
lists the desired zero as 11.6198412. 


Problems 


14. For the interpolation problem discussed in example 12, estimate 
(a) the error Jo(x) — ps(s) due to pure interpolation; 
(b) the error p*(s) — ps(s) due to throwback. 
(For (a), use the integral representation of problem 9, chapter 9, to 
estimate the high derivative of J2 required.) 

15. Proceeding as in the derivation of Everett’s formula, obtain an inter- 
polation formula that uses only zeroth, third, sixth, ... differences. 

16. Devise a method for ‘throwing back the second differences into the 
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zeroth” in Everett’s formula, and estimate the error involved. Why is 
the method less efficient than the one discussed in the text? When would 
it be useful? 

17. The zeros of the Bessel function Jo(x) are known to approach the points 


x =(n -—4)7r eae a a eer 


Verify this statement by evaluating Jo((n — 4)7), n = 1, 2,3,..., using a 
table of Bessel functions. 

18. Check the values of the modified differences given in example 12 by 
forming ordinary second and fourth differences. 


Recommended Reading 


Finite difference techniques have been pushed to an especially high level 
in Great Britain. Books such as Fox [1957] as well as the volumes issued 
by the National Physical Laboratory [1961] and by the Nautical Almanac 
Service [1956] contain much excellent advice on interpolation by differ- 
ences. On the whole, however, the subject of interpolation of tables is 
somewhat in the eclipse due to the fact that most tables have been 
replaced by prestored programs in digital computers. 


chapter 12 numerical differentiation 


Above we have used the interpolating polynomial to approximate values 
of a function f at points where f is not known. Another use of the 
interpolating polynomial, of equal or even higher importance in practice, 
is the imitation of the fundamental operations of calculus. In all these 
applications the basic idea is extremely simple: Instead of performing the 
operation on the function f, which may be difficult or—in cases where f is 
known at discrete points only—impossible, the operation is performed ona 
suitable interpolating polynomial. In the present chapter this program is 
carried out for the operation of differentiation. 


12.1. The Error of Numerical Differentiation 


Let f be a function defined on an interval J containing the set of points 
S = {Xo, X1,.--,X,} (mot necessarily equidistant) and let Ps be the 
polynomial interpolating f at the points of the set S. We seek to 
approximate 
f(x) by Ps), xel, 


and wish to derive a formula for the error that must be expected in this 
approximation. 

It seems natural to obtain an expression for the error f’ — Ps by 
differentiating the error formula (9-5). If f has a continuous derivative 
of order n + 1 in J and if xe J, it was shown in §9.2 that 


(12-1) fl) — P(x) = Loe), 
where 

(12-2) La) = (x = x) = a4)... =) 
(12-3) 8) = Gln ee 


231 
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é,, being some unspecified point in the smallest interval containing x and 
all the points x, i = 0,1,...,”. Corollary 9.2 shows that, although no 
assertion can be made about the continuity or differentiability of €, as a 
function of x, the function g can be extended to a function that is con- 
tinuous on /. A similar consideration shows that the extended function 
is even n times continuously differentiable on 7. Hence we obtain by 
differentiating (12-1) 


(12-4) f(x) — Ps(x) = L'(x)g(x) + LODe'(). 

If x is arbitrary, this expression is not of much use for the purpose of 
estimating f’ — Ps, since we lack a convenient explicit representation for 
g’ such as (12-3). However, if x = x; (j = 0,1,...,”) in (12-4), we 
obtain by virtue of L(x,) = 0 

F°(%5) — Ps(xj) = L'(x)g(%)). 
Recalling that 


T] (x — x) 
L(x) = lim =°——_—. = _ lim ia (x—-x) = i (x; — x) 
I+2j xX — Xx; I >2y 1=9 1=0 


we can state the above result as follows: 


Theorem 12.1 Let the function f be continuous and 7m + 1 times 
continuously differentiable on an interval J containing the n+ 1 
distinct points x9, X;,...,X,, and let Ps; denote the polynomial of 
degree < n interpolating f at these points. Then for each j, 7 = 0, 
1,..., the interval spanned by the largest and the smallest of the 
points x, contains a point &, such that 


(12-5) fxs) — Psy) = ine wen meres) if (x; — %). 


179 


Problems 


1. For a sufficiently differentiable function f, f’(0) is approximated by 
differentiating the polynomial P interpolating f at the points 0, , 2h,..., 
nh. Give a formula for the error f’(0) — P’(0). 

2. Indicate a /ower bound for the error if the derivative f’(0) of f(x) = e7 is 
replaced by the derivative of the polynomial interpolating f at the points 
O,1,...,” 

3. Give a formula for the error of numerically differentiating f at x = 0 by 
differentiating the interpolating polynomial using the points —A, 0, h. 
Obtain the same error formula by subtracting the Taylor expansion for 
f(—A) from that for f(A), both terminated with the term involving 4°. 
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12.2 Numerical Differentiation Formulas for Equidistant Abscissas 


We now shall assume that the points x, are equidistant, x, = X9 + kh, 


k =0, +1, +2,.... For given m and n > m, 
eke: — x;) 
iF 








is smallest when x, lies halfway between the extreme points x,, and x,. 
This suggests the use of Stirling’s formula (11-19) 


pasts) = fo + (7) [udfe + 5 8%] + 


stk—-1 2k-1 Sox 


where s = (x — Xo)/h, for numerical differentiation. The derivative with 
respect to x at Xp equals h~* times the derivative with respect to sat s = 0. 
The derivatives of the coefficients 


sstk—1 _ 
(de. KALB 


are zero, since they contain the factor s*. For the remaining coefficients 
we find 





d(/s+k-—1 
a 2k — 1 Nhe 
. lfs+tk—1 
_ 2k —1 ) 
nm Stk = Ds +k = 2).. (s + 1)(s —1)...8 —k + 1) 
lim (2k — 1)! 
i(k — IP 


SD OES Ih 


The derivative of the polynomial P2,(x) = pox((x — Xo)h~*) at the 
central point x, can thus be written 


os 


(12-6) Plato) = 5 {ufo — GP 18h + 


(kK — 1)! ok ry : 
_1\k-1 2k-1 
or, evaluating the numerical coefficients 


; 1 
Pox(Xo) = h {Hdfo — gud%fo + soHO fo — ra0H8 fo +:--}- 
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Formulas in terms of ordinates can be found by expressing the central 
differences 5**~+f_,,. and 6?*—1f,,. in the form > c,f, and arranging the 
coefficients c, in a table, as follows: 


Table 12.2a 
n —1 0 1 
8f— 112 = 1 I 
Sfije 0 —1 1 
p3fo —4 0 -} 


Table 12.2b 





The average of the coefficients in each column gives us the corresponding 
coefficient for 48?*-1f5. For the second and fourth degree polynomials 
we find in this manner 


(12-7) Pu) = op fi — f-a) 
(12-8) Pi(xe) = 7 BU ~ f-s) ~ tha ~ 2 + a — Sad} 


1 
= AGRE + 8f, — 8f_1 + f_2). 


An expression for the error f’(Xo) — P2:(Xo) is easily determined from 
(12-5). With m = 2k, and the interpolating points arranged equidistantly 
and symmetrically about x9, we find 


1  &D? 
Gk + ti 0) = (-D" aii 


It thus follows that 


h*, 


(12-9) f'%0) — Phatea) = GP EO paerrporenyy 


where x_, < & < x,. As in the calculation of approximate function 
values by algorithm 11.2, it is thus true that the error committed in 
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differentiating Stirling’s polynomial in place of the function f equals the 
first omitted term in the difference formula, provided that »6°**1f, is 
replaced by h?**1/@*+)(£), 


Problems 


4. By using the Newton forward formula, devise a formula for numerical 
differentiation at the beginning of a table and obtain an error formula 
similar to (12-9). 

5. Making use of Bessel’s formula, obtain a formula for numerical differentia- 
tion halfway between two entries in a table, and write the formulas 
resulting from the polynomials of degree one and three in terms of 
ordinates. 

6. By differentiating Stirling’s form of the interpolating polynomial twice, 
show that 


f"(%0) & 5 (8% — OY +++} 


Obtain a general formula for the coefficients on the right. 

7. Suppose that, due to rounding, the values f, are known only up to errors 
én, where |e,| S e. What is the maximum error resulting therefrom in 
the formulas (12-7) and (12-8), if all arithmetic operations are performed 
without rounding error? 

8. Suppose we calculate Jg(x) by means of (12-7) from a table of Jo(x) giving 
six decimals. What is the smallest value of 4 for which we can guarantee 
that the maximum possible discretization error (due to replacing Jp by P) 
is not exceeded by the maximum possible error due to rounding? 


12.3. Extrapolation to the Limit 


The formulas derived in §12.2 can also be obtained by an entirely 
different method which recalls the fundamental principle of numerical 
analysis applied already in the derivation of Aitken’s 4?-method: Improve 
the accuracy of an approximation using any (possibly incomplete) 
knowledge of the asymptotic behavior of the error. 

The simplest numerical differentiation procedure consists in 


; : l 
replacing f’(xo) by ah (fi — f-1) 


(see formula (12-7) above). We introduce an abbreviation for the 
expression on the right that emphasizes its dependence on the step h by 
defining the basic differentiation operator D, as follows: 


(12-10) Daf) = 55 Ue + A) — foe - A) 
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The error formula (12-9) then states in the special case kK = 1 that 


(12-11) Dif (Xo) — f’ (Xo) = $h?f (8), 


where X), ~h< E< XO +A. 

In order to obtain a more accurate statement about the error of D,f, 
we use Taylor’s expansion with remainder term. If the function f/f is 
sufficiently differentiable, we have for k = 0, 1,... 


fle + A) = fl) + BF) $+ GSW) + PLO. 


where é is a point between x and x + A. Writing down this expression 
for k = 2m and x = x, and then subtracting from it the same expression 
with h replaced by —h, we obtain after dividing by 2h 


(12-12) Drf(%o) = f'(%o) + ah? + agh* +--+ + Gn—yh?™~-? + Rylh), 
where 


di, @k+(x), k=1,2,...,.m—1, 


1 
~ (2k + pit 


_ h2™ y bcs D(é,) + Yaad Dé.) 
Ri) = Gael OO 


Here &, is a point between x and x + h and &, is between x and x — A. 
If f?2"* is continuous, it assumes every value between 


FEMME) and fEm* (Es). 


Thus in particular for some é between £, and &, (and thus between x — h 
and x + h) 


Foren (é) = SU fe * E,) + fem*D(€,)]. 


It follows that the remainder R,, in (12-12) can be expressed in the simpler 
form 


(12-13) R,(h) = omelet PO 


With this form of the remainder, (12-12) for m = 1 reduces to (12-11). 
In numerical applications the values of f’(xo) and of the constants 
a1, 4j,... are of course unknown. But the mere fact that a formula of 
type (12-12) holds can be used to improve the accuracy of a numerical 
differentiation. Suppose the operator D, is applied with two different 
values of h, say h and gh, where q # 0,1. (If f is tabulated at equally 
spaced intervals, gq = 4 is a natural choice.) We then have from (12-12) 


Dif (Xo) = f'(%o) + aih? + O(n‘), 
Danf (Xo) = S'(X%o) + ar(gh)? + O(h4), 
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since O(qh) = O(h). Eliminating from this pair of equations the constant 
a, and solving for f’(xo), we find 


(12-14) f'(X0) = Def Eee + 0(h4). 


Using two values of the basic differentiation operator D,, (which ordinarily 
has an error O(h?)) we thus have succeeded in deriving a differentiation 
formula with an error of only 0(A*). 

Without a more careful study of the error term we cannot assert that 
(12-14) is more accurate than (12-10) for any given A. However, if f® is 
continuous, then (12-14) is always more accurate for sufficiently small 
values of h. 

In the special case g = 2 (12-14) takes the form 


, 1/4 ] 4 
F's) = 3 {55 Vi — Sa) — 555 a — F-a} + 0G. 
We thus obtain the approximate differentiation formula 


' _ fe + 8fi — 8f-1 + f-2 4 
f(x) = Bt Bh Ba FFs 5 ogy, 


which turns out to be identical with (12-8). 

Another way to look at formula (12-14), more in the spirit of Aitken’s 
A*-process, is to consider it as a device to speed up the convergence of the 
basic differentiation operator D, ash-—>0. The speeded up operator may 
be written in the form 

2 
(12-15) Dx, = 9 — i 


EXAMPLE 
1. To find the first derivative of J,(x) at x = 2. 


Table 12.3 
h DiJo(2) Di Jo(2) Di*Jo(2) 
0.40 —0.56611 8105 
0.20 — 0.57406 0360 — 0.57670 7779 
0.10 —0.57605 7896 —0.57672 3741 — 0.57672 4805 
0.05 —0.57655 8030 —0.57672 4742 — 0.57672 4808 


The exact value, to nine places, is Jo(2) = —0.57672 4808. 
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Nothing prevents us from trying to further speed up the sequence of the 
values D¥* f(x,). The appropriate formula can be obtained by considering 
the structure of the error of D*f(xo). From (12-12) we find 


Daf(Xo) = f’(%o) + a1q7h? + aaqg*h* +--+ + An—1q?"~*h?™-? + Ri(Gh). 
It follows that 
Dé,f(%0) = f'(Xo) + ad(gh)* +--+ + ak_i(gh)?"-? + RAGA), 


where 


‘ R,(gh) — q?Rnlh 
Rn(qh) = eC eal BD ) = = a 
Thus, in particular, 
Di f(Xo) = f'(Xo) + afh* + O(h°), 
DinS (Xo) = f' (Xo) + adg*h* + O(A°). 


Eliminating a¥ yields 


(12-16) f' (Xo) = Din f(%o) + O(h°), 
where 

* _ 74 /)* 
(12-17) Dit = 2a 1 O8 = Di 


Still using the same basic differentiation operator D,, we thus have 
obtained a differentiation formula with an error O(A°). Obviously the 
procedure can be continued, the only limits being those set by the accuracy 
of the values f(x) + h) themselves. The effectiveness of the procedure is 
illustrated by the last column of table 12.3, where the derivative has been 
obtained to nine significant digits from only six values of the function. 


Problems 


9. Using Taylor’s expansion, show that 


1 os h? 
a fo = f'(x0) + GLO, 
wherex —h< &€<xt+h. 
10. By eliminating a, and ag in the formulas 
Dun f(X0) = (x0) + ai(kh)? + ao(kh)* + O(45), =k = 1, 2, 3, 


obtain a differentiation formula that uses the abscissas f_3, f_2,...,f3 
and has an error 0(4°). Compare the result with the formula (12-6) for 
k = 3. 
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11. Suppose that, due to rounding, the values D, f(x) are known only vn to 
rounding errors not exceeding e/h. If the values D*f(x9) and D¥*f(xo) 
are formed with g = 4, what are their errors due to the inaccuracies in 


Dy f(x0)? 


12.4 Extrapolation to the Limit: The General Case 


Extrapolation to the limit as applied to numerical differentiation in 
§12.3 is a mere special case of the following general situation: An unknown 
quantity a) is approximated by a calculable quantity A(y) (y > 0) such 
that 


(12-18) lim A(y) = @, 
y70 
and it is known that there exist constants a,, a2,... and Cy, Ca,... sucht 
that for k = 1, 2,3,... 
(12-19) A(y) = A + ayy + Gey +++-+ ayy"? + RQ), 


where 
IRiy)| < Cyy*, y> 0. 


A triangular array of numbers A,,, (m = 0,1,2,...; 2 S m) is now 
formed in the following manner. 


Algorithm 12.4 For two fixed constants rand y) (0 < r < 1, yp > 0), 
let for m = 0, 1, 2,... 


An,o baa A(r™yo), 


Amn, n boo Po Again 
Api So Sat n=0,1,....m—1. 


The manner in which the numbers A,, , depend on each other is indicated 
in scheme 12.4. 


Ao,o 
Aj,0 Ay 
Ag,o Agi Ag,2 


A3,0 Ag3.1 A3,2 A3,3 


Scheme 12.4 


This scheme has the following interesting property. 


ft In technical language the hypothesis (12-19) means that A(y) admits an asymptotic 
expansion for y—>0+. We do not assume that the infinite series ag + ayy + 
azy*+ +--+ converges. 
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Theorem 12.4 Let A = A(y) satisfy the relations (12-19). Then for 
each nv such that a,,, #4 0, the (x + 1)st column in the scheme 12.4 
converges faster to a) than the mth column, in the sense that 


(12-20) lim Am2t1 ~ 20 


mo Ann — A 


= 0. 


More generally, for each fixed value of n 2 0, 
(12-21) Ann = Ao a (CeaT)tag ge eee yg )h ae O((r™ yo)" *?) 
as m—> ©, 


Here O(z) denotes a quantity that remains bounded for z—>0 when 
divided by z. 


Proof. We shall establish the following proposition, which is stronger 
than (12-21): For each fixed n, and for each p > n, 


(12-22) Ann = ao 7 An +1, n(P™Yo)"** + An+2, nro)" *? 
+ cS + Ay n(r™ Yo)? a O((r™yo)? **) 


as m-—> oo, where forg=n+1,n+2,... 


(1 — r3-1 — r?79)...€l — r™-9) 
22) fan “AT nd —r)...a—-r) 
(Empty products are to be interpreted as 1.) 

The proof is by induction with respect ton. For n = 0, the formulas 
(12-22) and (12-23) are a mere restatement of our hypothesis (12-19). 
Assuming that they are valid for some n 2 0, it follows from algorithm 
12.4 that Anns, has a representation of the form (12-22) where the 
coefficient of (r™yo)? is 
] — r™ti-a@ 


This is zero for g =n-+ 1 and equals a,,., forg > + 1, verifying 
(12-22) with n increased by one. 
Setting g = n + 1 in (12-23), we obtain 
j  a=—r-)d ~— rot... aE) 
are da — rl — r’)...d -— 7’) a 
_— po tater tn) (r” D>? _ 1). id .(r = l) 
aqd—-ni-r)...d—-—r) “* 


= (— 1)rnm + Wi2q ; 


aq, ne 


in view of the well-known formula 
_ An t+ 1) 


L+2+---4+n 5 
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Letting p = n + 1 in (12-22) and using the above value of a, 41, ,, we thus 
obtain (12-21). 
If a441 #4 0, relation (12-21) may be written 
Ann — 4 = (—1)’agg ar Ot P(r™yo)" +71 + O(r™yo)]. 
Thus, 
Am.i+1 ea ao ok, _ An+2 —-n-1lf.m m 
Fe KE tenye)Ll + OC V0) 


which implies (12-20). 
The technique described in §12.3, and in particular in example 1, 
corresponds to the special case A(h?) = D,, r = q? of algorithm 12.4. 


Problems 


12. A numerical computation furnished the following approximations 
A(2~") to a quantity ao: 





n A(2~") 

0 0.000000 
1 0.250000 
Z 0.316406 
3 0.343609 
4 0.356074 
5 0.362055 
6 0.364987 
7 0.366438 


Determine dao as accurately as you can. (Exact value: 
ao = e-} = 0.367879.) 
13. Show that the hypotheses of theorem 12.4 are satisfied for the function 
Ay) = (1 + yy)” — (y > O). 

14. Show that the scheme generated by algorithm 12.4 is identical with the 
scheme that would be obtained by calculating A(0) by Neville interpolation 
(algorithm 10.5) using the interpolating points yo, ryo, r2yo,.... 

15. Suppose the values of the function A(y) used in algorithm 12.4 are known 
only up to errors <«. How large is the resulting uncertainty in the 
columns Am»? Give numerical values for the “noise amplification 
factor’ for nm = 1, 2,3 and r = 0.5, r = 0.235. 

16. The following relations hold for the symbol 0(z) introduced after theorem 
12.4: 

O(z) + O(z) = O(z); 


O(cz) = O(z) for any constant c. 


From what theorems about limits do they follow? 
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12.5 Calculating Logarithms by Differentiationt 


In this section we shall discuss a further application of algorithm 12.4 
to a problem of numerical differentiation. Let a > 1 be a given number. 
We consider the function 


f(x) = a 
and wish to evaluate f’(0). By the rules of calculus, this of course equals 
f'(O) = log a. 


By carrying out the differentiation numerically, we may thus hope to 
obtain a method for calculating the natural logarithm of a given number. 

We use the following one-sided approximation to the first derivative, 
viz. : 


5) = f= LO 


= ig — 1). 


By the definition of the derivative, 
loga = f'(0) = lim S(h). 
h-0 


This limit relation also holds if we restrict A to the discrete set of values 
h=2-",n=0,1,2,.... Setting 


(12-24) 5, = S(27") = 2%a2-" — 1) 


we thus have 
log a = lim s,. 
SS no) 
The values a?~" required to form the numbers s, can easily be generated 
recursively by successively evaluating square roots, as follows: 


a-° =a, a?" = Va", 


We thus have, in principle, solved our problem. 

Readers who have studied §8.4 will immediately observe, however, that 
the algorithm thus defined is numerically unstable. The quantity a#~" — 1 
approaches zero as n—> 00. If the numbers a?”" are computed with a 
fixed number of decimals, the relative error of a?~" — 1 will ultimately 


+ This section may be omitted at first reading. 
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become quite large. By rule (ii) of §8.4, the relative error of s, will like- 
wise be large, and, since the s, approach a nonzero limit, the absolute 
error, too, will grow rapidly. 

A procedure that is more stable numerically can be defined if we 
generate the s, in a different way. Solving (12-24) for a?” " we get 


(12-25) ae 1 275s, 
On the other hand, 


ee uae (- eo 


= 2"-'(g2-" 4 1)(a2-" — 1). 


Substituting for a?” ” the value (12-25), we have 


2"-1(2 + 2-"5,)2-*s, 


= Se og: 


Sn-1 


Solving the quadratic for s, and observing that s, > 0, we find 
Sy = 2(V1 4 2-* FIs, — 1). 
To avoid loss of accuracy, we write this in the form 


25n—1 


12-26 SQ 
nae 1+ V1 +4 2-"*I5, 74 


Together with the initial condition s) = a — 1 this relation may be used 
to generate the sequence {s,} in a stable manner.f 

For numerical purposes the convergence of the algorithm thus obtained, 
even though stable, is intolerably slow (see problem 16). If it is to be put 
to any practical use, we must be able to speed it up. Let us check whether 
the function S(A) satisfies the hypotheses of theorem 12.4. Since a” = 
e” 0 We have, using the exponential series, 


S(h) = log a + 7 (log a)*h + i (log a)°h? +---. 


Since this series converges for all values of h, condition (12-19) certainly 
holds for A(h) = S(A), and theorem 12.4 is applicable. We thus obtain 


Tt Relation (12-26) also provides us with an example of those relatively rare non-linear 
difference equations that have an explicit solution. 
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Algorithm 12.5 For a> 1, let Apo =a-— 1, and calculate for 


ae ee ee 
2Am-1,0 
An.o SS OaDN————E————————Eee 
Le V1 + 2" | 
0c AS nn Arp n 
Annti = ee 
where 
(12-27) n=0,1,2,...,m— 1. 


Theorem 12.4 yields in the present special case 


Theorem 12.5 All columns of the scheme generated in algorithm 12.5 
converge to log a, and each column converges faster than the preceding 
one. 


One can also show that the sequence {A, ,} of diagonal elements 
converges to log a, and converges faster than any column. 


EXAMPLE 
2. For a = 6 algorithm 12.5 yields the following triangular array: 


Table 12.5 


5.000000 

2.898980 0.797959 

2.260338 1.621697 1.896277 

2.008267 1.756196 1.801029 1.787422 

1.895937 1.783606 1.792743 1.791559 1.791835 

1.842872 1.789806 1.791873 1.791748 1.791761 1.791759 
1.817076 1.791281 1.791773 1.791759 1.791759 1.791759 1.791759 


The table shows that log 6 = 1.791759 can be obtained to seven significant 
digits by the evaluation of six square roots. 

If higher accuracy is desired, it may be necessary to limit the number of 
columns of the scheme in order to avoid build-up of rounding errors. 
Condition (12-27) should then be replaced by 

n=0,1,...,min(m, N) — 1 


where WN is the number of columns desired. 


Problems 
17. Using Taylor’s formula, find a /ower bound for the error |S(h) — log al. 
Thus find a lower bound for the number n necessary that 


ls, — loga| < 0.5 x 10-8, 
if a = 6. 
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18. Use algorithm 12.5 to calculate log e to six decimals. 

19. Let 1 S ase. By obtaining explicit values for the constants C, in 
(12-19) from Taylor’s formula, find bounds for the errors of the quantities 
An.1 and Ano, considered as approximations to log a. 


Recommended Reading 


The principle of extrapolation to the limit was first clearly stated 
by Richardson [1927]. Repeated extrapolation to the limit is discussed 
by Bauer et al. [1963]. Concerning its application to numerical 
differentiation, see also Rutishauser [1963]. 


Research Problem 


Since every positive number can be represented in the form e"z, where 
e122 < z < el? = 1.648..., it is only necessary to know log x in the 
interval (1, e!/”). Make a comparative study of the computation of log x 
in that interval (a) by algorithm 12.5, (b) by Taylor’s series. 


chapter l 3 numerical integration 


We now turn to the problem of numerical evaluation of definite integrals. 
The method is the same as in chapter 12. Instead of performing the 
integration on the function f, which may be difficult, we perform the 
integration on a polynomial interpolating f at suitable points. We begin 
by giving a theoretical appraisal of the error committed in this 
approximation. 


13.1. The Error in Numerical Integration 


Our starting point is once again the general error formula proved in 
§9.2. There it was shown that if P; interpolates f on the set of points 
S (Xo Niss og Way HON 
(13-1) I(x) — Ps(x) = LOx)g(x), 
where 

L(x) = (X — Xo)(X — X1)...(% — Xn), 


and where g is a continuous function that can be expressed in the form 


—_ ! (1 +1) 
é,, being a suitable number contained between the largest and the smallest 
of the numbers Xo,..., x, and x. If we integrate (13-2) between two 


arbitrary limits a and b, we evidently have 


[40 dx = i: P;(x) dx + RY, 


where the integration error RY: can be expressed in the form 


b 
(13-3) REY = i L(x)e(x) dx. 
246 
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Let us now assume that the polynomial L does not change its sign in the 
interval (a,b). This will evidently be the case if and only if no inter- 
polating point x, is contained in the interior of (a, b). We then can apply 
to the integral representation (13-3) the second mean value theorem of the 
integral calculus (see Buck [1956], p. 58), with the result that 


b 
RE = g(t) | L(x) dx, 


where ¢ is a suitable point in (a, b). In view of the definition of g, we thus 
have the following result: 


Theorem 13.1 Let the function f and the polynomial P satisfy the 
conditions of theorem 9.2. If the interval (a, 6) contains no inter- 
polating point in its interior, then 





(13-4) i foe ree [ P(x) dx =f [ iGyas 


where & is a number contained between the largest and the smallest 
of the numbers Xo,..., X,, @, 0. 


It is clear the theorem 13.1 cannot be true if (a, b) contains an inter- 
polating point, since it then may happen that the integral on the right of 
(13-4) vanishes. 


EXAMPLE 
I. x) = —1, x, = 0, x. = 1, L(x) = x0? — 1), a = —8|, 


is L(x) dx = 0. 


Problems 


1. The integral i sin x dx is evaluated by interpolating the function f(x) = 


sin x at the points x = 0 and x = zw. Calculate the bound for the error 
given by theorem 13.1 (using the fact that all derivatives are bounded by 1), 
and compare it with the actual error. 

2. Dropping the assumption that the interval (a, b) contains no interpolating 
point, show that 








[se ax — |" Poo ae| s ME" een) ax 


where Mu41 = Maxze; |f"* P(x). 
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13.2 Numerical Integration using Backward Differences 


Once again we specialize to the situation where the interpolating points 
xX, are equidistant, x, = X) + kh, k = +1, +2,.... We begin by de- 
riving an integration formula which appears to be lacking in symmetry, 
but which for this very reason will be of use in the numerical integration of 
differential equations. 

Let us assume that the function f is interpolated at the points xo, 
X_4,..., X_,, and that the integral of fis desired between the limits x» and 
xX,. Setting as before 
x — Xo. 


h 





5= 


we use the Newton backward polynomial in the form given in problem 7, 
chapter 11, 


P(x) = DXG "(7 )V*%o 


Introducing s as variable of integration and observing that the differences 
are independent of s, we find 


ie P(x) dx = af {> (-1"( (,)¥ "toh ds 


k 
= h > Cin Vos 
m=0 
where 
1! fag 
(13-5) C= (-r | ( ‘) ds, m=O Aye 2 ives 
o \m 


We also have, in the present case 


L(x) = oe — ade = saad OA os) 


— (_1\k+1petif —*% )\, 
= (peer) 


&+D! ; 1)! 


Since L does not vanish in the interval (xo, x,), and since 


(k+D! : 1)! [ L(x) dx = he enya, 


theorem 13.1 thus yields the formula 


(13-6) [" fla) de = Heofo + Ufo + coV%o + 
; a c.V*fo + Cpegn EOE), 
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where x_, < £ < x,, and where the constants c, are given by (13-5). 
As on several previous occasions it is seen that the error due to inter- 
polation equals the first omitted term in the difference formula, provided 
that the difference is replaced by the corresponding derivative, suitably 
normalized (compare similar statements in §11.2 and §12.2). 

From the definition (13-5) we find 


1 1 -—S§ 
c= [lds = 1, a=-[ (j)a=4 
0 0 


For larger values of m, however, the c,, are more easily calculated by a 
method to be explained in §13.4. 
In an entirely similar manner we can derive the formula 


13-7) f° fide = Met hy + Vo + EV + 


+ c#V fo + cf, she tft PO}, 
where x_, < & < Xo, and 


0 = 
(13-8) ge (- | (3) ie GeO 14 
-1 


In particular we find 
co. =A, cf = —4. 


Again a similar statement about the error applies. 


Problem 


3. Find a formula that integrates between x_,,;2 and x1,2, using backward 
differences at x». What can you say about the error? 


13.3. Numerical Integration using Central Differences 


If values of f are available on both sides of the interval of integration, it 
seems preferable to perform the integration on a polynomial that takes into 
account all these values. Bessel’s formula (11-20) recommends itself for 
the purpose of integrating between the limits x) and x,. We again 
introduce s as a variable of integration. If s is replaced by 1 — s, the 
binomial coefficients 

stm—l 
Con) 


remain unchanged, but the factors s — 4 change sign. It follows that all 
integrals of coefficients of the form 


6-9 7 ') 


2m 
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vanish. If P.,4, denotes the polynomial interpolating f on the set 
Xeks Xe 1. ++ +s Xee1, We thus find 


I Par as(a) dx = |" {> ( ec ")u8?fara} de 


=h S bb? ™f1 19, 
m=0 
where 
Li/is+m-— 1 
(13-9) bn = [ ( a ) ds, fee a 
Since, in the present case, 
] _ (% = X_a MX — X~n41)...(K — Xea1) 
Gk + DIL) = (2k + 2)! 
2 jee s+ a 
2k +2 


and consequently 


OE caer L(x) dx = WP**, 1, 


theorem 13.1 yields the formula 


(13-10) | * f(x) dx = A{boufia + bip8%Fryo + 


ae 5,18?" 0 ae Dy 4 ph? + Af + 20 E)), 


where x_, < & < Xx41. 
We easily find b) = 1, b; = --7;. Thus the case k = 0 of (13-10) 
yields 


[1 =H +A - GLO. 


where X) < € < x,. This is the familiar trapezoidal rule of numerical 
integration. More about it in §13.6! 


Problems 


4. By integrating Stirling’s formula (11-19) between the limits x_, and x, 
obtain an approximate integration formula of the form 


i. f(x) dx = hfsofo + 5187fo +-+° + 5,87*fo + Resi, 
z=} 
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and show that the remainder R;,., is of the order of h?**?. (Theorem 13.1 
is not applicable in this case, why?) In particular, show that so = 2, 
5S, = 4 and hence obtain the approximate integration formulas 


I i f(x) dx = 2hfo + O(h%) (Midpoint rule); 


i] * S(x) dx = ; (f-1 + 4fo + A) + 0(h9) (Simpson’s rule). 
z-1 
5. Using Taylor’s expansion, show that 


is f(x) dx = 2hfo + 4h8f"(S), x-1< & < mx. 


13.4 Generating Functions for Integration Coefficientst 


We left unresolved the problem of finding numerical values of the 
coefficients c,,, c*, and 5, introduced in the two preceding sections. 
Such values are conveniently found by the method of generating functions. 
The generating function of a sequence of coefficients cm, for instance, is 
the function C defined by the power series 


C(t) = > Cat”. 


If we succeed in determining a closed formula for C, we may hope to find 
numerical values of the c,, in a very simple manner. 

We exemplify the method with the coefficients c* defined by (13-8). 
Their generating function is 


C*(t) = > c#t™ = > (pe [. ey ds. 


It is easily seen that |c%| S$ 1, m =0,1,2,.... Hence the power series 
converges uniformly for |t| $4, say. Interchanging summation and 
integration we find 


oe] 


C*(t) = [ SC (7) ne 


-1 m=0 


By the general form of the binomial theorem (see Taylor [1959], p. 479), 
~ m{ —~5\am — _ #\-8 
2D ()e =(1—/-*. 


+ This section may be omitted at first reading. 
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We thus find 
0 
C#(t) = | (1 — 78 ds. 
-1 
The integration can be carried out by observing that 


(1 a 2 ils = e~Siog(1-f) 


We thus find 
1 
é *(f) = -s log (1-t)]0 
(13 11) Cc (t) —log (1 = t) [e i 
oye Sees 
~ ~log(l— 2d 
Observing 


_log (i — 2) 


; =1+4ht4+5°+---, 


we thus have the identity 


1 = C*(t) alee = 


= (ct + c¥t + ck? +---)(L + 4t4+ 402 4+---). 


Comparing coefficients of like powers of ¢t on both sides of this identity, we 
find 

co = 1, 

1 
m+ 1 





ch + 4c%_, +--+ cc =, ae (ee eee 


These relations can readily be used to calculate the values of the coefficients 
c*® recursively. 


Table 13.4a 





In a like manner, we obtain for the generating function 


(13-12) C(t) = 3 Cnt™ 
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of the coefficients c,, defined by (13-5) the closed expression 


t 


the recurrence relation cy = 1, 
1 


Cn + Fema + 3€me2 to + Hoo = I, We ly 2 


and from it the numerical values 


Table 13.4b 





In order to obtain a recurrence relation for the coefficients b,, defined 
by (13-9), we write their generating function in the form 


B(t) = > b,(2t)2" = [ > (* oe ‘ane ds. 


To evaluate it, we require the formula 


=~ (stm—-l1\..., + '-*+(7-t-* 
(13-14) = ( ie Jen = St 
where 
T=V14+?. 


This is a result from the theory of Legendre functions (see Erdelyi [1953], 
equation 3.2 (14) in connection with 3.5 (12)) whose derivation cannot be 
given here. Some manipulation yields 


~ 2m — t 
BO) = 2, dal? = ETH 


Since 


# tog (I + 1) =2- ia: [le i (5°) a 


we easily find 
1 


ttlog(T+)=1+3 (37) + : (5*)« foes 
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Hence we have 


log (T + ¢) 2.0 
ear a ae 
1 
5 


or 
3 1 2 _ 


=1+ (=?) + (3*)e foes 


Comparing coefficients, we get 


(13-15) 4b, + 4 — (57)bn-1 ised eae (= 3)b. _ (), 





3 1 


The following values are obtained: 





Table 13.4c 
m 0 1 Zz 3 
ee ee ee 
12 720 60480 


Problems 


6. Find the generating function for the coefficients in the formula 


7*. 
. Using generating functions, show that 


fo) = 5 a My + ag A%fy + +>] 


obtained by differentiating the Newton forward interpolating polynomial 
atx = Xo. 
Obtain the generating function for the coefficients introduced in problem 4. 


* 3 
ce = Cm — Cm-15 m= 1, 230 


13.5 Numerical Integration over Extended Intervals 


In §13.2 and §13.3 we have considered the problem of evaluating the 


integral of a function f over an interval of length A in terms of differences 
calculated with the step h. Here we shall study the equally important 
problem of evaluating 


r= ff) de, 
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where [a, b] is a fixed, not necessarily short, interval as accurately as 
possible. It is assumed that fis continuous on [a, b] and can be evaluated 
at arbitrary points of that interval. Several procedures offer themselves ; 
we shall be able to dismiss two of them very briefly. 

(i) Newton-Cotes formulas. The most natural idea that offers itself 
seems to select a certain number of interpolating points within [a, 5], to 
interpolate f at these points, and to approximate the integral of f by the 
integral of the interpolating polynomial. If the interpolating points 
divide [a, b] into N equal parts, we arrive in this manner at certain integra- 
tion formulas which are called the Newton-Cotes formulas. Unfortunately 
these formulas have, for large values of N, some very undesirable prop- 
erties. In particular, it turns out that there exist functions, even analytic 
ones, for which the sequence of the integrals of the interpolating poly- 
nomials does not converge towards the integral of the function f, Also, 
the coefficients in these formulas are large and alternate in sign, which is 
undesirable for the propagation of rounding error. For these reasons, 
the Newton-Cotes formulas are rarely used for high values of N. For 
N = 2, 3, 4 the formulas are identical with certain well-known integration 
formulas which we shall discuss below from a different point of view. 

(ii) Gaussian quadrature. One may try to avoid some of the short- 
comings of the Newton-Cotes formulas by relinquishing the equal spacing 
of the interpolating points. Gauss discovered that by a proper choice of 
the interpolating points one can construct integration formulas which, 
using N + 1 interpolating points, give the accurate value of the integral if 
fis a polynomial of degree 2N + 1 or less. These formulas turn out to be 
numerically stable, and they are in successful use at a number of computa- 
tion laboratories. The formulas suffer from the disadvantage, however, 
that the interpolating points as well as the corresponding weights are 
irregular numbers that have to be stored. This practically (although not 
theoretically) limits the applicability of these highly interesting formulas. 

In the following two sections, we shall discuss two integration schemes 
that are easy to use in practice. They are (iii) the Trapezoidal rule with 
end correction, and (iv) Romberg integration. 

Both can be regarded as more sophisticated forms of the trapezoidal 
rule discussed in §13.3. 


13.6 Trapezoidal Rule with End Correction 


As in the discussion of the Newton-Cotes formula, let us divide the 
interval [a, b] into N equal parts of length 


_~o-4a 


h WN 
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We write 
X, = a+ nh, n=0,1,...,N (ty = bd) 


and evaluate the integral of f over each of the subintervals [x,_1, x,] 
separately, using a different interpolating polynomial each time. Using 
the symmetrical formula (13-10), we find for the integral over the nth 
subinterval, if fhas a continuous derivative of order 2k + 2, 


(13-16) i F(x) dx = h{bopfn-1j2 + O1v8"*fy je +° °° 
+ by p87"? "Ff, 0 + Dy 4 hP* tPF EK t OE, 


where x,-1-~% < &, < Xn4,- Adding up the integrals (13-16) for n = 
1,2,..., N, we clearly obtain, in view of x) = a, xy = b 


(13-17) I= | fx) ax 


= hb 


Maz 


N 
Ufn—12 + By > oe a, 
1 n=1 


N N-1 
+ b, > H8?"f aja + Day he *? > sanra(e)} 
n=1 n=0 


where by = 1, b; = —}5,..., are the constants defined by (13-9) and 
tabulated in §13.4. Let us consider the sums on the right separately. In 
view of 


Un 1/2 = 4fn-1 + fr) 


the first term 


h 2, fara = hECo + fi) + 4 + fa) ++ + Hv-1 + Al 
clearly reduces to 


(13-18) Ty a hh fo +f; + fo pees + free + faa + + fw]. 


We shall call this term the N-point trapezoidal value of the desired integral. 
It is the result of evaluating the integral by the trapezoidal rule familiar 
from elementary calculus (see Taylor [1959], p. 515). 

Even more drastic simplifications occur in the other sums on the right 
of (13-17). Recalling that 


62m wee  ahuamees = Sas ME 
and hence 
pO?"f, 10 = wb?" Tf, — pb?™ Uf 3 
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we find that the remaining sums in (13-17) “‘telescope”’ as follows: 


N N 
>, #82 "fa-ua = >, (482*-Yf, — wd?"-3f, 1) 
n=1 n=0 


= pd?" My — pd2"-Yfy 


(m = 1,2,...,k). Thus, each of the sums multiplying },, bo,..., 5b, 
reduces to a difference of two central differences at the endpoints of the 
interval of integration. 

In order to simplify the last term, let us denote by M the maximum and 
by m the minimum of the function f‘**?(x) for x_,-1 S x S Xnanse 


The sum 
N 


> fete), 


n=1 
having N = (b — a)/h terms, is then contained between the limits 


2a. and 5-4 


j , 








Since the continuous function f?**® assumes all values between its 
extreme values, there must be a é in the above-mentioned interval such 
that 


b ; a flak + 2g) Se 2 FE). 





For that value of €, we thus have 


N 
A2e+2p, > f2kt WE) = (b — ayh2**1b, fee +22), 
n=1 


Gathering together the above results, we have 


Theorem 13.6 Let N and k be any two positive integers, and let f 
have a continuous derivative of order 2k + 2 on the interval J = 
fa — kh,b + kh]. Then 


bd 
(13-19) | fx) = Ty + CH + RY, 


where 7y denotes the trapezoidal value of the integral defined by 
(13-18), C¥? denotes the “‘end correction” 


(13-20) CH = h{i(H8fy — #8fo) + ba(ud%fy — 28%) ++ 
+ Dy(ud?*-ly — w8?*-4f,)}, 


and where, for some suitable £ € J, 


(13-21) RP = WPF, — a)fP**(), 
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In order to formulate the integration procedure implied in theorem 13.6 
in algorithmic terms, one would have to devise a systematic method for 
forming the sequence of end corrections Cf (k = 1,2,...). This 
would best be done by first forming, in the neighborhood of the points 
Xo and xy, the two-step differences 


Bop, = Whn+1 = Jena) 


and then using the identities 
pdo2m-1p os 52" 2(,5f,) 
= §2m-2 (Avy es Ine fuss a) (n = 0, N). 


The coefficients b,, can be generated recursively from (13-15). 


EXAMPLE 
2. To evaluate {o * Jo(x) dx. We choose the step # = 0.1, leading to 
N = 16. From a table of Bessel functions we find 

Ti5 = 1.28934 6003. 


Since J)(—x) = Jo(x), the contributions to the end correction at x» = 0 
is zero. At xi, = 1.6 we find 


uofie = —0.05692 1406, hb,p5fi¢ = 0.00047 4345, 
pS*fig = 0.00040 8458,  hb.u5f,_¢ = 0.00000 0624, 
u8>f,_¢ = —0.00000 3327,  hb,u8°f,¢ = 0.00000 0001, 


yielding the more accurate value 
1.6 
| Jo(x) dx = 1.28982 0973, 
0 


correct to the number of places given. 


Problems 


9, Evaluate 


en 
sin x 
ic 
oO x 


with an error less than 107°. 
10. Devise a method, similar to the one discussed above, based on the formula 
obtained in problem 4. (Divide into an even number of subintervals.) 
11. Using the integration formula (13-6) and a similar formula involving 
forward differences, obtain an end correction for the trapezoidal rule that 
involves values f, satisfying 0 S n S N only. 
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12. Show that if the function fis periodic with period 7, and if formula (13-19) 
is used to evaluate the integral of f over a full period, then 


C® =0 forall k and N. 


13. Problem 12 shows that for the integration of a sufficiently differentiable 
periodic function over a full period, 


I ” fox) dx — Ty = O82") 


for every k. Does it follow that the trapezoidal value is exact? 
14. Show that the trapezoidal rule Ty yields for N > 1 the exact values of the 


integrals 
2x 2n . 
( cos x dx, I sin x dx. 
(0) (9) 


(Use problem 19, chapter 6.) 


13.7 Romberg Integration 
Theorem 13.6 states, in effect, that the trapezoidal approximation to 
[2 (@) dx calculated with the step h, 
Ath) = Tyan, 
where M(h) = (b — a)h™’, satisfies 
A(h) = dy — Cry + O(h?**?), 


where ay is the exact value of the integral, and where C{}, is defined by 
(13-20). If we could show that for every positive integer k 


(13-22) Ci, = ah? + agh? +--+ + a,h?* + O(h2**?), 
( 


then A(h) would satisfy the hypotheses for successful kfold extrapolation 
to the limit, that is, it would be of the form (12-19) where y = h?. We 
might expect then to speed up the convergence of the trapezoidal values 
by an application of algorithm 12.4. 

By expanding the differences appearing in (13-20) in powers of A, it is 
easy to see that Ci, indeed can be expanded in the form (13-22). For 
instance, 


hubf = 3K -f-2) 
= Wf") + Ef) bo 


hub = 3 (fa - fi + fa —f-2) 
= hif"(x9) +>. 
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We thus may apply algorithm 12.4, keeping in mind that y = h?. The 
choices of the values yp and r are dictated by the obvious procedure to 
begin with a single subinterval (N = 1), and then to double the number 
of subintervals at each step. Since y = A? this means that the ratio 
between consecutive values of y is r = 4. The algorithm thus results in 
the triangular array 


Ao, 0 
Aj,0 Ay 1 
Ag,o Ao: Ag,2 


A3,o A314 A3,2 A3,3 


defined by the recurrence relations 
(13-23) Am, o = To", m= 0, l, pa eeey 


+1 
4” Ain noo Am-1.10, 


que] n=0,1,....m—1. 


(13-24) Anata = 


In order to define an economical algorithm, we note that of the ordinates 
necessary to compute 7," only those whose index is odd have to be freshly 
calculated. The others are known from 7J,~-:. The computation is 
most easily arranged by introducing besides the trapezoidal values Ty the 
midpoint values 


My = A fie + Sain +++ + + fy -112) 
=h > fla+m— ph (n = 75"). 


Like the trapezoidal value, the midpoint value is an approximation to the 
desired integral that is in error by 0(k?). Weclearly have 


Ton me a [4fo + fire +h oie + f-1 + fr-1e2 a5 + fy] 
= 4(Ty + My). 


Thus in the present situation algorithm 12.4 can be formulated as follows: 





Algorithm 13.7 If the function f is defined on the interval [a, 5], 
generate the triangular array of numbers A,,, by means of the 
recurrence relation (13-24) and 


Apo = T, = 4[f(@) + Sf) 
Ano = Me =h> fat@—P—, h=2-%b—a), 


Amn+1,0 = H(An.o + Amn, 0)» m=0,1,2,... 
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The hypotheses of theorem 12.4 are satisfied when / satisfies the hypoth- 
eses of theorem 13.6 for every k when N is sufficiently large. This is the 
case if f has derivatives of all orders on an interval containing the interval 
[a, b] in its interior. This condition, in turn, is satisfied if f is analytic 
on the closed interval [a, b] (see Buck [1956], p. 78). 

Theorem 13.7 Let the function f be analytic on the closed interval 
[a,b]. Then all columns of the array generated by algorithm 13.7 
converge to I f(x) dx. If none of the coefficients a, in (13-22) 
vanishes, each column converges faster than the preceding one. 


The scheme generated by algorithm 13.7 is commonly known as the 
Romberg scheme (see Bauer et al. [1963]). 


EXAMPLE 
3. For the problem considered in example 2 the Romberg scheme is as 
follows: 


Table 13.7 


1.16432 1734 

1.25919 0749 1.29081 3754 

1.28220 7763 1.28988 0101 1.28981 7857 

1.28792 0410 1.28982 4626 1.28982 0927 1.28982 0976 
1.28934 6003 1.28982 1201 1.28982 0973 1.28982 0973 


The accuracy of extrapolation to the limit with 16 steps is about the same 
as applying 7,,_ with “‘end correction.” The advantage of the Romberg 
procedure is that no decision has to be made in advance concerning the 
best stepsize. 

Problem 13 shows that it may happen that some or even all coefficients 
in the expansion (13-22) are zero. In such cases extrapolation to the 
limit will not speed up the convergence of the trapezoidal rule. 


Problems 


15. Verify experimentally that extrapolation to the limit does not speed up 
the convergence of the trapezoidal rule for the integral 


ik Vx dx. 


pai 

sin x 
[PS ax 
fe} x 


with an error of less than 10~°, using repeated extrapolation to the limit. 


Can you explain why? 
16. Evaluate 
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17. Show that the values Am,, obtained by algorithm 13.7 are identical with 
the values obtained by applying Simpson’s rule (see problem 4) to 2™~1 
subintervals. 

18. Express the values A,,2 in terms of the ordinates f,. Can you recognize a 
familiar integration rule? 


Recommended Reading 


Hildebrand [1956] gives an excellent account of Gaussian quadrature 
in chapter 8. Many examples of the use of generating functions to obtain 
integration coefficients are given in Henrici [1962], chapter 5. Bauer et al. 
[1963] contains a definite treatment of Romberg integration. Milne 
[1949] gives numerous integration formulas not considered here and also 
has a general treatment of the error in numerical integration. 


Research Problems 


1. Make a comparative study of the effectiveness of the end correction 
versus repeated extrapolation to the limit. 

2. Study the connection of the end correction with the Euler-MacLaurin 
summation formula. 


chapter 14 numerical solution of differential 


equations 


The mathematical formulation of many problems in science and engineer- 
ing leads to a relation between the values of an unknown function and the 
value of one or several of its derivatives at the same point. Such relations 
are called differential equations. The present chapter is devoted to the 
problem of finding numerical values of solutions of differential equations. 


14.1 Theoretical Preliminaries 


Let f = f(x, y) be a real-valued function of two real variables defined 
fora S$ x S b, where a and bd are finite, and for all real values of y. The 
equation 


(14-1) y’ = f(x, y) 


is called an ordinary differential equation of the first order; it symbolizes the 
following problem: To find a function y = y(x), continuous and differenti- 
able for x € [a, b], such that 


y'(x) = f(x, v*)) 


for all x € [a, b]. A function y with this property is called a solution of 
the differential equation (14-1). 


EXAMPLE 
1. The differential equation y’ = —y has the solutions (x) = Ce-*, 
C = const. 


As the above example shows, a given differential equation may have 
many solutions. In order to pin down a solution, one must specify its 
value at one point, say atx = a. The problem of finding a solution y of 

263 
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(14-1) such that y(a) = s, where s is a given number, is called an initial 
value problem and is schematically described by the equations 


y’ = f(x, y), 
ae oe: 
EXAMPLE 
2. The initial value problem y’ = —y, y(0) = 1 has the solution 
y(x) = e~*. 


In courses on differential equations much stress is placed on differential 
equations whose solutions can be expressed in terms of elementary 
functions, or of indefinite integrals of such functions. It is shown, for 
instance, that any differential equation of the form y’ = g(x)y + p(x) can 
be solved in this manner. The emphasis on explicitly solvable differential 
equations tends to create the impression that almost any differential 
equation can be solved in explicit form, if only the proper trick is found. 
Nothing could be farther from the truth, however. Most differential 
equations that occur in actual practice are much too complicated to admit 
an explicit solution; if numerical values of the solution are desired, they 
can only be found by special numerical methods. 


EXAMPLE 

3. The differential equation y’ = 1—y can be solved explicitly; for the 
equation y’ = 1—y + ey® sin x, which for small values of e differs only 
little from it, no such solution can be found. 


The fact that a differential equation does not possess an explicit solution 
does not mean that the solution does not exist, in the mathematical sense. 
The following existence theorem is proved in courses on differential 
equations (see Coddington [1961], p. 217). 


Theorem 14.1 Let the function f= f(x, y) be continuous for 
asxsb,-—o < y < o, and let there exist a constant L such that, 
for any two numbers y, z and all x € [a, 5], 


(14-3) If y) — f(x, =| S Lily - 2I- 


Then, whatever the initial value s, the initial value problem (14-2) 
possesses a unique solution y = y(x) for xe [a, bd]. 


This theorem can be proved, for instance, by the method of successive 
approximations. Although this is, in a sense, a constructive method for 
proving the existence of a solution, the method is not suitable for the 
numerical computation of the solution, because it requires the evaluation 
of infinitely many indefinite integrals. The methods to be given below 
are far more economical and are therefore preferred in practice. 
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Condition (14-3) resembles condition (4-4) imposed on functions 
suitable for iteration, with the difference that the function f now depends 
on the additional parameter x. The condition is again called a Lipschitz 
condition. 


Problems 


1. Which of the following functions f satisfy a Lipschitz condition ? 





2 
(a) fv) = 4 
(b) fa =7 ear 08x81; 
(c) fy) = ¥*5 
(d) f(x,y) = VI. 


2. Show that the following condition is sufficient for f to satisfy (14-3): The 
partial derivative f, exists and is continuous fora SxS b,-w<y< @, 
and there exists a constant M such that 


f(x, »)| SM 
for all values of x and y in the indicated domain. 


14.2 Numerical Integration by Taylor’s Expansion 


Throughout the balance of the present chapter we shall assume that the 
function f not only satisfies the conditions of the basic existence theorem 
14.1, but also that it possesses continuous derivatives with respect to both 
x and y of all orders required to justify the analytical operations to be 
performed. We shall refer to this property by the phrase “‘f is sufficiently 
differentiable.” 

It is a basic fact in the theory of ordinary differential equations that if f 
is sufficiently differentiable, all derivatives of a solution of the differential 
equation (14-1) are expressible in terms of the function fand its derivatives. 
In fact, this statement is obviously true for the first derivative. Assuming 
its truth for the mth derivative, we write 


(14-4) ym = po Oe y), 


where f“~, the (7 — 1)st total derivative of f with respect to x, is a 
certain combination of derivatives of f. Differentiating the identity 


yM(x) = FE" P(X, VX) 
once more with respect to x, we have 
Yrt (x) = FE 7(%, VO)) + A YO) '() 
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and thus, in view of y’(x) = f(x, y(x)), 
Y™* Vx) = F(X, y(X)), 


where 
(14-5) LOR, VY) = fP- PC, ¥) + AP, YF, Y). 
We thus have 


Algorithm 14.2 If fis sufficiently differentiable, the nth derivative of a 
solution y of (14-1) is expressible in the form 


YO(x) = f- PC, p(x), 2 = 0,1, 2,... 


where the functions f™ can be calculated by means of the recurrence 
relation 


FO =f FMD =f 4 MF am = 01,2... 
In place of f“ and f we shall also write f’ and f”. 


EXAMPLES 


4. Sf! =fe + tif 
f’ = Sex + feyf + Ivf + fry + fal + IDS 
= fre + 1 J dea at PY + fz + AD Yy- 


5. Let f(x, y) = x? + y?. We find 


S'(%, y) = 2x + 2y(x? + y?), 
S"(x, y) = 2 + 4xy + (2x? + 6y?)(x? + y?). 


In view of the fact that the derivatives of the solution of a differential 
equation can be determined whenever the derivatives of fcan be calculated 
(which is, in principle, always the case if fis an elementary function of x 
and y), one might think of approximating the solution y of the initial 
value problem (14-2) by its Taylor series at s = a. In view of the initial 
condition y(a) = s this series takes the form 





(14-6) px) = 5 +“ fas) + = a f(a, 8) 


GoD Pas) te 


+ 

However, the above examples suggest that unless the function fis very 
simple, the higher total derivatives of f rapidly increase in complexity. 
This circumstance makes it necessary to truncate the infinite series (14-6) 
already after very few terms. This necessarily restricts the range of values 
of x over which the truncated series (14-6) can be expected to define a 
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good approximation of the solution y. One therefore will use the trun- 
cated series with a small value of x — a,x — a = hsay, and re-evaluate the 
derivatives f, f’, f”,..., at the pointx =a+th. 


Problems 
3. Calculate the nth total derivative of f if 
(a) f(xy y)=y +e, (db) f(x, y) = e*y. 
4. If f(x, y) = (2y/x) — 1, show that f” is identically zero. What is the 
explanation of this fact? 
14.3. The Taylor Algorithm 


In order to formalize the procedure outlined in the preceding section we 
introduce the following notation. Let h > 0 be a constant, and set 


X, =a+tnh, ean ee Cae een 


We shall always denote by y, a number intended to approximate y(x,), 
the value of the exact solution of the initial value problem (14-1) at 
x = x,. If the method of Taylor expansion is used, these values are 
calculated according to the following scheme: Let, for some fixed integer 
p2i, 


hPe-} a 
pie ae 





1 h ,, 
(14-7) T0934) = SOY) + LOY) Ho + 
The numbers y, are then calculated successively as follows: 


Algorithm 14.3 For a fixed value of h, generate the sequence of 
numbers {y,} by the recurrence relation 


(14-8a) Yo=5, 
(14-8b) Yn+1 = Yn + ATS(Xn, Yn; A), n = 0, 1 25 mates 
This scheme is known as the Taylor algorithm of order p. 


EXAMPLE 


6. The Taylor algorithm of order one is particularly simple. In view of 
T,(x, y; h) = f(x, y), (14-8b) then becomes 


(14-9) Yn+1 =a t+ hf (Xns Yn) 


This is also known as the Euler method, or as the Euler-Cauchy method. 
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Naturally we expect the values defined by (14-8) to approximate the 
corresponding values y(x,) of the exact solution, particularly when A is 
small. To see whether this is the case, let us examine the purely academic 
example 


(14-10) y(0) = 1, es 
The exact solution, of course, is y(x) = e*. In view of 
T(x, y) = f'(x, y) = f"(x, y) rons a V5 


the function 7, takes the simple form 





1 h he-} 
T,(x, y3 h) = (Gt att ma )y. 


The recurrence relation (14-8) thus becomes 


h | kh he 
Y= (lt Rta tet i) 





0.50 


Figure 14.3 
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This is a difference equation of order 1 (see chapter 3); the solution 
satisfying the initial condition yy = 1 is given by 

h . FP hv 
(14-11) wa (lt qty tt a) WHOA De exes 
We wish to examine how closely y, approximates y(x,) = e*» at a fixed 
point x of the interval of integration, if the approximation y,, is calculated 
by successively smaller integration steps 4. We thus must let h — 0 and 
n—> oO Simultaneously in such a manner that x = x, = nh remains fixed. 
By Taylor’s formula with remainder term (see Taylor [1959], p. 471) 


h kh hP biog 
h nig or eat Oh 
CPS a ae ob Va Gael: 


1! * 2! 
where @ is some unspecified number between 0 and 1. We now use the 


fact that for any two real numbers A and B such that A 2 B 2 O and for 
n= 1,2,.4. 


A" — B" =(A _ B)(A"7} oh A®-2B +..-+ AB*-?2 + Br?) 
< (A — B)nA"=?. 


In particular, taking 


Ae 
2 

eae * feee+ * 

we obtain 
V(%n) — Yn = eM — (1 + - peg ZY 
sn oy 7 ere mn 

or, in view of nh = x,,0 < 0 < 1, 
(14-12) Ya — WO S ayy He 


This relation shows that, at a fixed point x, = x, the error in the approxi- 
mation of the exact solution y by the Taylor algorithm of order p tends to 
zero like h? as the stepsize h tends to zero. 

The convergence if the integration steps are successively halved is 
illustrated in figure 14.3. 

We shall now derive a similar bound for the error y, — y(x,) of the 
Taylor approximation y, to the solution y of an arbitrary initial value 
problem (14-2). To this end we make the following assumptions: 
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(i) Not only the function f, but also the function 7, defined by (14-7) 
satisfies a Lipschitz condition with Lipschitz constant L: 


(14-13) [T,(x, ¥;3 hk) — T,(x, 2; A)| S Lily — 2| 


for any y, z and all x and A such that x ¢€ [a, b], x + he [a, 5]; 
(ii) The (p + 1)st derivative of the exact solution y of (14-2) is con- 
tinuous and hence bounded on the closed interval [a, b], and 


(14-14) [y?*+%x)| S Yous, x € [a, b]. 
We then have: 


Theorem 14.3. Under the above assumptions (i) and (ii), the error 
Zn = Vn — Y(x,) Of the Taylor algorithm of order p is bounded as 
follows: 


Yoo. eto — 1 
(p + 1)! L 


The result shows that z, again tends to zero at least like A? as x, = x is 
fixed and h — 0. 


(14-15) [Zn] << AP 


Proof. By Taylor’s formula with remainder, the exact solution satisfies 
V(%n+1) aes VXn ce h) 
ae e wee (p +1) 
= Win) + ATSC» Won) + ay? PE 


where &, is some point between x,, and x,,,. Subtracting the last relation 
from (14-8b), we get 


(14-16) Zn4a = 2n + ALT ns Yn A) — To(%n YO%n); AY] 


hP +1 
—@F! yer ey) 
and hence 


[Znaal S [Znl + AlTo%ns Yas A) — To(Xns VO%n); AD| 


he** [| pete |. 
*@+D! " 
Using (14-13) and (14-14), the expression on the right can be simplified, 
with the following result: 
Yp+1 F 
(p + 1)! 


We now make use of the following auxiliary result: 





(14-17) [Znaa| S [Zn] + AL|z,{ + A?*? 
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Lemma 14.3 Let the elements of a sequence {w,} satisfy the 
inequalities 


(14-18) Wasi (1 +a)w, +B, n=0,1,2,... 


where /, a, and B are certain positive constants, and let wo = 0. 
Then 





(14-19) w, = B , BOLT 2o secs 
Proof of lemma 14.3. The relation (14-19) is evidently true for n = 0. 
Assuming its truth for some nonnegative integer n we have from (14-18), 
in view of | + a < e’, 
ent — | 
a 
na __ (n+l)a _ 
= pil + aje Ls Be 1 


nN tg 


a = a 


IA 


(1 + a) B+B 





Wn+1 


establishing (14-19) with n increased by one. The truth of the lemma now 
follows by the principle of induction. 

Returning to the proof of theorem 14.3, we apply the lemma to the 
relation (14-16), setting w, = |z,|, @=AL, B= h?*'Y,,,/(p + 1!. 
By virtue of (14-8a) and (14-2), |z)| = 0. Thus the conclusion (14-19) 
applies, yielding (14-15). 


Problems 
5. Solve the initial value problem 


y=1-y, y(0) = 0, 


by the Taylor algorithm of order p = 2, using the steps h = 0.5, h = 0.25, 
and A = 0.125, and compare the values of the numerical solution at 
Xn = 1 with the values of the exact solution. 

6. Find an analytical expression for the values y, defined by (14-9) for the 
initial value problem 





fe ey = 
aa rR y(0) = 1, 


and verify the statement of theorem 14.3. [Hint: Use theorem 3.3 to 
solve the difference equation involved.] 


14.4 Extrapolation to the Limit 


The estimate for the error z, = y, — y(x,) given in theorem 14.3 should 
not be regarded as a realistic indication of the actual size of the error. It 
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merely serves to prove the convergence of the Taylor algorithm and to 
indicate the order of magnitude of the error. The estimate states, in 
effect, that there exists a constant K so that 


(14-20) [Zn] < Kh? 


for all x, € [a,b]. This relation shows that the error tends to zero as 
h-»>0. Can we make a statement about the manner in which the error 
tends to zero? As we have observed repeatedly, such knowledge could be 
helpful for the purpose of speeding up the convergence of the method as 
h— 0. 

We shall prove: 


Theorem 14.4 If f is sufficiently differentiable, the errors z, = y, 
— y(x,) of the values defined by (14-8) satisfy 


(14-21) Zn, = h?z(xn) + O(A?*?), 
where z denotes the solution of the initial value problem 
z(a) = 0, 
(14-22) te 1 Bs ngs 
z’ = f(x, y(x))z — pri’ P+™(x). 


Proof. We start from relation (14-16) above. By Taylor’s theorem, 


oT 
T(Xns Yns A) — T(Xns V(Xn)3 A) = By (Xn, (Xn); A)Zn + O(Z2) 


ss = (Xny V(Xn)3 O)Z, + O(zz) + O(KZ,) 


= peer WXn))Zn F O(h? +), 
since T,(x,y;0) = f(x,y). Also, y?*?(E,) = y?*?(x,) + 0(h). We 
thus have 


ay 77 ed + OC, 


From this we subtract h? times the relation 


2(Xn41) = 2(X%n) + Az'(xy) + O(h?) 


Zn+1 = 2p + hf (Xn, WXn))Zn _ 


= 26%) +h] furan oe) =%) — IV? *PEa)] + 007) 


which follows from the definition of z. Setting 


Wn = Z_ — h?z(x,) (n = 0,1, 2,...) 
we obtain 
lWaeil S [wal + AL|w,| + O(A?*?). 
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Since wy = 0, lemma 14.3 now yields the relation w, = O(h?*+) equivalent 
to (14-21). 

In order to make explicit the dependence of the numerical values y, on 
the step # with which they are calculated, we shall denote (in this section 
only) by y(x; A) the approximation to y(x) calculated with the step A. 
Relation (14-21) can then be written more explicitly as follows: 


(14-23) y(x, h) = y(x) + h?z(x) + OCA? *2). 


In general, both quantities y(x,) and z(x,) are unknown in this relation, 
the latter because it depends on the solution of a differential equation 
involving the unknown function y. However, the mere fact that a 
relation of the form (14-21) holds can be made the basis for an extra- 
polation to the limit procedure. Eliminating z(x,) between (14-22) and 
the similar relation 


y(x, r~*h) = y(x) + r7Ph?z(x) + O(h? +?) 


for the numerical approximation calculated with the step r~+h, where 
r # 1, we find that the quantity 


_— pP -1 
(14-24) yi(x, #) = MS ee 
differs from y(x) by only 0(A?*). As h +0, the values y*(x, A) will thus 
converge to (x) faster than the values y(x,). Here r is, in principle, 
arbitrary; for practical reasons one usually chooses r = 4. 
Under the assumption that (14-23) holds in the generalized form 


(14-25) YX; A) = V(X) + g(x)? + ayy i(x)hP** +--- 
+ a(x)h* + O(h**2) 


for every k > p, we can speed up the values y*(x, A) once more, using 
(14-24) with p replaced by p + 1. Continuing the process systematically 
in the manner of algorithm 12.4, we obtain 


Algorithm 14.4 For a fixed x in [a, 5], form the triangular array of 
numbers A,, , as follows: 


Ano = »( 5 =F") MN = ON 2) a5 


pt+n a 
Annas = Sm Amat, n=0,1,....m—1. 


The fact that (14-25) holds for sufficiently differentiable functions f was 
established by Gragg [1963]. If this is taken for granted, we can deduce 
as in theorem 12.4 that each succeeding column of the array A, , converges 
faster to y(x). 
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EXAMPLE 
7. We integrate the initial value problem 


y=x-y, y0)=0 


by Euler’s method, using the steps A = 0.4, 0.2, 0.1, 0.05. The table 
below shows the resulting values at x = 0.8, and the values obtained by 
repeated extrapolation to the limit. 


Table 14.4 
h An,o An,1 Amn,2 Amn,3 Am, 4 

0.8 0.00000 

0.4 0.16000 0.32000 

0.2 0.23682 0.31363 0.31151 

0.1 0.27201 0.30720 0.30505 0.30413 

0.05 0.28871 0.30541 0.30481 0.30478 0.30482 
Problems 


7. Let the initial value problem 
y=-y y0)=1 


be solved by the Taylor algorithm of order 1. Where does the absolute 
value of the error (approximately) attain its maximum value? Verify 
your result numerically by performing a numerical integration using the 
step h = 0.1. 
8. Improve the values obtained in problem 5 by extrapolation to the limit. 
9. Devise a method for extrapolation to the limit if p, the order of the 
method, is not known. 


14.5 Methods of Runge-Kutta Type 


The practical application of the Taylor algorithm described in the 
preceding sections is frequently tedious because of the necessity of evaluat- 
ing the functions f/’, f”,... at each integration step. It therefore is a 
remarkable fact that formulas exist that produce values y, of the same 
accuracy as some of the Taylor algorithms of order p 2 2, but without 
requiring the evaluation of any derivatives. We mention only a few of 
the many available formulas of this type. 

(i) The simplified Runge-Kutta method. Here we replace the function 
T2 in algorithm 14.3 by Kz, where 


(14-26) K,(x, v3 h) = 41%, y) + f(x + Ay + Aft, y)))- 
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Clearly, no evaluation of f’ is necessary. Instead, the function f is 
evaluated twice at each step. It can be shown that 
(14-27) K,(x, y; h) — T(x, y; h) = O0(h?), 


and that the accumulated error satisfies an inequality of the form (14-20), 
and, with a suitable definition of z, a relation of the form of (14-21), where 
p= 2. 

(ii) The classical Runge-Kutta method. Here the function 7, in 
algorithm 14.3 is replaced by K,, a sum of four values of the function f, 
defined as follows: 

(14-28) Ki(x, y3 h) = 3fk, + 2ke + 2kg + kal, 


where 
k, = f(x, y), 


h h 
kp =f(x+5y+3h) 


h h 
ks = s(x + yy + 5 ka)» 
kg = f(x + hy y + hks). 
It can be shown that 
K(x, y3 A) = Ta(x, yh) + O(A*) 
and that, as a consequence, the relations (14-20) and (14-21) now hold with 
= 4, (The proofs of these statements are extremely complicated). In 
view of this fact, the classical Runge-Kutta method is one of the most 
widely used methods for the numerical integration of differential equations. 
(iii) A mixed Runge-Kutta-Taylor method. If the first total derivative 


of f is easily evaluated, we may use in place of 73 in algorithm 14.3 the 
function G(x, y; h) defined by 


h,, h h 
(14-29) Gylx, 95) = fl, 9) + 5 (x + py + FFG). 
It can be shown that G,(x, y; h) — T3(x, y; h) = 0(h°) and that, as a 
consequence, (14-20) and (14-21) are true for p = 3. 
Problems 
10. Solve the initial value problem 
y=x-y*, y0)=0 
for 0 S$ x S$ 0.8 by the classical Runge-Kutta method, using the step 


h =0.1. Also, perform the same integration with the steps 4 = 0.2 and 
= 0.4, and perform an extrapolation to the limit. 


276 elements of numerical analysis 


11. Establish the relation (14-27). 

12. Show that for the differential equation y’ = Ay (A = const.) the values 
yn produced by the methods Kez and K, are identical, respectively, with the 
values produced by 72 and 7,4. 

13. Prove that G3 differs from 73 by 0(h°). 


14.6 Methods Based on Numerical Integration: The Adams-Bashforth Method 


All the methods discussed so far are based directly or indirectly on the 
idea of expanding the exact solution in a Taylor series. A different 
approach to the problem can be based on the idea of numerical integration. 
A solution y of (14-1) by definition satisfies the identity 


¥'(x) = f(x, y(*). 


Integrating between the limits x, and x,,, we obtain 
Tn +1 
(14-30) Wins) = Win) + "Fes yO) ax 


Let us now suppose that we have somehow already obtained approximate 
values Yo, 1,.--, ¥n Of the solution y at the points xo, x1,..., Xn» where 
we again assume the points x,, to be equally spaced, x, = a+ mh. The 
approximate values of f(x, y(x)) at these points then are 


Sn = fms Va); m=0,1,..., 2. 
If k S n, we can use these values to approximate f(x, y(x)) by the Newton 
backward polynomial 


ee Ny 


P,Q) = > (oF Avy, = 25 


Performing the integration in (14-30) on P, in place of f, we obtain the 
following algorithm for computing a new value y,,,: 


Algorithm 14.6 (The Adams-Bashforth method.) For a given 
function f = f(x, y), a given constant A > 0, a given integer k 2 0, 
and given values yo, Yi, .- +) Ve» letfn = f(Xms Ym) (m = 0, 1, 2,...,k), 
and generate the sequence {y,} recursively from the formulas 


(14-31) oe cee 
Yasir = Yn + hlcofn + Vf, + C2V7fa +++ ++ CV"), 


and 


Invi = f(Xnen Yn+1) = V°fn +i 
Vink = A dae Seer = Ve fas m= \ Pe eres ag We 


wheren = k,k 4+ 1,.... 
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Here the coefficients c,, are defined by (13-5); for numerical values see 
table 13.4b. The working of algorithm 14.6 is illustrated by scheme 14.6 
fork = 3. 








Xo Yo to 
VA 
xy 1 fi V*fo 
Vfo V*fs 
Xe ye to 
X3 ¥3 Ss 
X4 Va Ss 





eles Scheme 14.6 


As soon aS y,4, is known, f,,, can be calculated and a new diagonal of 
differences may be formed. The process requires exactly one evaluation 
of f per step, no matter how many differences are carried. This compares 
favorably with methods such as the classical Runge-Kutta method, which 
requires four evaluations. 

Algorithm 14.6 does not say how to obtain the starting values yo, yi,...;, 
y, This can be done by any of the methods discussed in §14.3 or 
§14.5; other starting methods, more in the spirit of the algorithm itself, 
are also known (see Collatz [1960], p. 81). 

Let us now study the error z, = y, — y(x,) of the values defined by 
(14-31) under the assumption that the starting values are in error by at 
most 6: 


(14-32) lZml = |¥m — YOXm)| < 4, =O) dis vas K. 
Expressing differences in terms of ordinates, (14-31) appears in the form 


Yn+i =n + h{bioSn 7 Dir fn-1 SH a Dricfn—xhs 


where the 5, are certain constants, involving the c,, that need not be 
specified. The corresponding relation for the exact solution is, by (13-6), 


WXn+1) a Y(Xn) + {Dy Y' (Xn) eo Dir V (Xn-1) 5 ee bin (Xn-1)} 
Higa ht Ay NE.) 


where &, is a point between x,_, and x,,,. We now subtract the last 
two relations from each other. Observing that 


fn — ¥'(Xm)| = [fins Vm) — fms V%m))| 
= Ll Yn — VXm)| = LlZnl; 
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where L denotes the Lipschitz constant of the function f, and assuming 
that 


(14-33) Iv") <S Yyun a Sx 


IA 


b, 
we obtain 


lZnzal S [Zn] + AL{[Dxol [Zn] + [Ora] [Zn-a] +--° + [Brel [Zn-xl} 
+ |ecarlA*t? Yi 42. 


Our aim is to find an explicit bound for the quantities |z,|. An induction 
argument shows that 


\z,| S Wa; n=0,1,2,... 
where {w,} is any solution of the difference equation 
(14-34) Wn+1 = Wa + AL{ |bx0|Wa + [Ba |Wa—1 5 a 2 [Dice] Wn — xt 
F [Cea il[h**? Vive 
satisfying 
(14-35) lw,| 2 8, n=0,1,...,k. 


We try to find such a solution using the principles set forth in §6.7. 
Since the nonhomogeneous term in (14-34) is a constant, the equation 


clearly has the particular solution w, = —C, where 

= ICs lA*t Vive 
(14-36) C= as 7 aa 
(14-37) By = |Bxo| + Berl +-+++ [Dex]. 


In order to obtain a solution satisfying (14-35), we add to this solution a 
suitable solution of the homogeneous equation 


Wasi = Wa + ALE |Dyo|Wa + [bia |Wn-1 ee [Dicte| Wn — 1} 
The characteristic polynomial of the equation is 
p(z) = z*tt — 2* — AL{|byo|z* + +--+ [Byy| }. 


Evidently, p(1) S$ 0. On the other hand, if z = 1 + ALB,, where B,, is 
given by (14-37), then 


1 + ALB, — 1 — AL{|Byo| +++ + [Bxxlz7*} 
= 0, 


It follows that for some z* such that 1 S$ z* S$ 1 + ALB, 
p(z*) = 0. 


z~ ¥p(z) 


IV Il 
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Thus a solution of (14-34) satisfying (14-35) is given by 


Ww, = (6 Ss Cyt —C. 
In view of 
Ze = (1 + hLB,,)" < ef@n — OLB, 


we finally find, remembering the definition of C, 


(X_,—-@)LB, _ 
(14-38)  |z,| S Se(ta- OLB, 4 eearlt®*! Ye u9 2 ke 1: 
LB, 
asx, S65. 


In summary, we have 


Theorem 14.6 If the starting values of the Adams-Bashforth method 
are in error by at most 5, then the errors z, = y, — y(x,) are bounded 
by (14-38), where Y,,,. and B, are defined by (14-33) and (14-37). 


As on earlier occasions, the estimate (14-38) should be looked at as a 
qualitative rather than a quantitative statement. The important thing is 
that, if the starting errors are sufficiently small (namely at most O(A**?)), 
the error of the Adams-Bashforth algorithm tends to zero like h**} as 
h->0. Interpolating f in (14-30) by an interpolating polynomial of 
degree k has the same effect as expanding the solution in a Taylor poly- 
nomial of degree k + 1. However, the coefficient of Y,,,. in (14-38) is 
markedly greater than the corresponding coefficient of Y,,, in (14-15), 
on account of the fact that both 


1 


[Cr +1 > (k + 2)! and B,, > | 


fork > 0. 


Problems 


14*. Assuming that the starting values are exact, show that the errors z, of the 
Adams-Bashforth method satisfy 


Zn = 2(X,)h*** + O(A**?), 
where z denotes the solution of the initial value problem 


2’ = fx, W(X))Z — Ceri yt (x), 
z(a) = 0. 


[Analog of theorem 14.4.] 

15. Solve problem 10 by the Adams-Bashforth method with k = 2, deter- 
mining the starting values y_, and y, by Runge-Kutta or by a series 
expansion around x = 0. Use the steps A = 0.2 and h=0.1 and 
extrapolate to the limit at x = 0.8. 
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14.7 Methods Based on Numerical Integration: The Adams-Moulton Method 


A considerable refinement and improvement of the Adams-Bashforth 
method can be obtained as follows: Suppose a tentative value of y,4, 
has been obtained, for instance by the Adams-Bashforth method. We shall 
call this value y,. Setting ff, = f(%n41, ¥o2.1), we can construct the 
polynomial which at the points x,_,, Xn—k+1.+++» Xn» Xn+1 takes on the 
values fps Sn-x+19---oJudn+1. According to the Newton backward 
formula, this polynomial is given by 


k+1 


Pras) = > (D0 7, “) VAL 


where again s = (x — x,)/h, and where the differences are formed with 
the values f{°.,ff,...,fn-~. Integrating between x, and x,4,1, we 
obtain a new value of y,,, to be called y2?.,, according to the formula 


(14-39) yi, = Yn + MEAD + CEVA + CEVA + 
+ cB VEYf. 


Here the coefficients c* are defined by (13-8) and are tabulated in table 
13.4a. 

In general, the value y{2, calculated by (14-39) will disagree with 
y,. We therefore compute f{?, = f(xXn+1, ¥¢21), correct the difference 
table of the values of f, and reevaluate the term on the right in (14-39). A 
new value y), will result. The general step of this iteration procedure is 
described by the formula 


(14-40) yep? = ya + Afed fats + cfV fits +++ chia VE ei, 


the differences on the right being formed with ff). = f(xn+1, yo), 
FIno+++stn-z- Theoretically, the iteration with respect to i should be 
continued indefinitely, and the limit as i> 00 of the sequence yi. , would 
be accepted as the final value of y,,,. In practice, of course, the iteration 
continues only to the point where the sequence y‘?,, has converged 
numerically in the sense that an inequality 


[yat? — Varal <e 
holds, where « is some preassigned number. More frequently even, the 
iteration is performed only a fixed number of times, say for? = 0, 1, 2,..., 


I, where J is a preassigned integer such as 2 or 3. We state the last version 
of the procedure as a formalized algorithm. 


Algorithm 14.7 (The Adams-Moulton method.) For a given func- 
tion f = f(x, y), a given constant h > 0, given integers k > 0 and 
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I> 0, and given values yo, ¥1,.--, Vx let Xm = Xo + mh, f, = 
ST (Xm: Ym) (m = 0,1,...,k), and generate the sequence y, by the 
following set of formulas: Form V/f,, V7f,,..., V*f;,, and calculate 
forn=k,k +1,... 


Xne1 = Xn + h, 


14-3] 
ee) Be = yi Meek PEMA PONE 


and 


Yn+1 = Visas 
where fori = 0,1,...,7— 1 


a = f(Xn+15 Vers 
v” ned = Aided Bran = ves (m ae l, Dy . k + 1) 


and 
(14-40) yltP = ya + Aled Ati + cEVARR +e + chi VPA). 


The reader will observe that this algorithm involves three iterations: 
The “outer” iteration for calculating the sequence {y,}, an inner iteration 
for calculating the sequence of values {y®, ,} for each n, and a further 
inner iteration for calculating the differences V"/?, foreach i. (Problem 
19 shows that this last iteration can be avoided, however.) Since the 
tentative values y{., are ‘“‘corrected” at each application of formula 
(14-40), this formula is called a corrector formula. The formula which 
produces a first approximation yf}, is called a predictor formula. The 
Adams-Moulton method thus belongs to the class of so-called predictor- 
corrector procedures. Like the Runge-Kutta method, it is one of the 
most reliable and most widely used procedures for the numerical integration 
of ordinary differential equations. 

Let us convince ourselves that the sequence { y\?, ,} generated by repeated 
application of the corrector formula would converge for i—> oo. For 
the purpose of analysis, we express the differences in (14-40) in terms of 
ordinates and observe that only the foremost values f{?, involve y®,,. 
Setting 

Yr = Wy, P= 0, 1,2, 0645 


equation (14-40) is of the form 
(14-41) Wi+i = F(w,), 


where 
F(w) = h(cS + ch +--+ + cher fans, W) + C, 
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C being a constant independent of w. According to theorem 4.2 the 
sequence {w,} defined by (14-41) converges—for any choice of the starting 
value wo—if there exists a constant K < 1 such that, for any two real 
numbers y and z, 


(14-42) IF@) — FO) $ Klz - I. 
Problem 8 of chapter 13 shows that 
(14-43) cS tcf +--+ + Che. = Chea 


where c,4, is defined by (13-5). Hence if f satisfies a Lipschitz condition 
with Lipschitz constant L, 


| F(z) ae F(y)| —e h|ce+1| [S@na 4 ae ACP y)I 
< Alegs|L|z = yl. 
Thus (14-42) is shown to hold with 
K = AlegsilL, 


and this is less than one whenever 


| 


14-44 h < ——: 
( ) [CeaalL 


We thus have obtained: 


Theorem 14.7 The inner iteration necessary to obtain each new 
value y,,, in the Adams-Moulton method converges for any choice 
of the predicted value y(?}, whenever the step / satisfies (14-44). 


By theorem 4.2, the quantity 
Yu+1 = lim Yr'i 
is a solution of the equation 
(14-45) Yasar = Yn + Aledfnes + c£Vhasr + t+ Chev aaa 


Theoretically, we could thus define the Adams-Moulton values y,,, as 
the solution of the non-linear difference equation (14-45). The error 
Zn = Yn — V(X,) Of these values can be investigated by a method entirely 
analogous to that used in the proof of theorem 14.6. Expressing differ- 
ences in terms of ordinates, we write (14-45) in the form 


Yasir = Ya + hte s0fn+1 a bE stn er beac +1Sn-Kh 
Setting 
(14-46) BE. = lOesacol + [Oeeial te7° + [OB aesals 
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and assuming that the starting values satisfy (14-32), we can show that 


1 


ss Se Se 
(14-47) |z,| S$ 7— Wocall 


{bets —@)Bh4 4b 


(x, —-a@)B?. Lk 

en k+il — | 
* k+2 

+ |ch2|h**? Yuus ae 


Bis 


for asx, 5 6. Here L again denotes a Lipschitz constant for the 
function f. If the starting values are sufficiently accurate, then the error 
bound (14-47) is considerably better than the corresponding bound 
(14-38) for the Adams-Bashforth method using equally many points, 
because h**1 is now replaced by h*+?, and [cio] < [ecai|, Bhar < By. 

Similar bounds can also be obtained for the more practical case where 
the inner iteration is only performed a finite number J of times, as indicated 
in algorithm 14.7. It can be shown that the above qualitative conclusions 
still hold when J = 2. 


Problems 


16. Determine the constants b;., and b¥431., for k = 0,1, 2,3 and calculate 
the values of the constants B, and BF, 4. 

17. Repeat problem 15, using the Adams-Moulton procedure with k = 2. 

18. Give a proof of (14-47) along the lines of the proof of theorem 14.6. 

19. Show that, by virtue of (14-43), the values y?,, defined in algorithm 14.7 
can be calculated more simply from the relation 


at? = Per + hens if Onsi, Ver) — fre YREP)I, (i 2 1). 


14.8 Numerical Stability 


In order to bring the discussions of the two preceding sections to a 
more concrete level, we now shall determine explicitly the numerical 
solutions of the initial value problem 


(14-48) {” = is 


(exact solution: y(x) = e4*) by several methods based on numerical 
integration. 

We begin with the Adams-Bashforth method (algorithm 14.6) fork = 1. 
Since co = 1, c, = 4, (14-31) yields the recurrence relation 


Ynti =n + hh a5 4Vf,) 
= Yn t hGAn Zs tfi-1) 


or, since in the present case f, = f(%n, Vn) = AVas 


(14-49) Yn+1 = Ya t+ Ah(GYn = 4Yn-1): 
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This is a linear difference equation of order two with constant coefficients. 
Its characteristic polynomial 


p(z) = 22 — (1 + $AA)z + 4Ah 
has the zeros 
zy = 44+ 34h + V4 4+ SAA + 5&(Ah)? 


and, by Vieta 
Ah 


Zp =< 
a Dah 


These zeros are certainly distinct for sufficiently small values of h. By 
§6.3 the general solution is thus given by 
Yn = C124 + C929, 


where c, and c, are two arbitrary constants. Determining these constants 
in such a manner that the solution satisfies yp) = 1, we obtain 


(14-50) Yn = (1 — €2)zt + C228, 


where cz is a function of the as yet unspecified value ),, 


_ Ji 71. 
(14-51) Co = eee 


As in §14.3, we now shall study the behavior of y, as h-> 0 and n > 0 
while nh = x remains fixed. We begin by expressing z? in such a way 
that this behavior becomes evident. Expanding the square root in powers 
of Ah (see Taylor [1959], §15.10) we find 


z, = 1 + Ah + (AA)? — 4(AA)? + OCA‘). 
By virtue of 
ean = 1 + Ah + 4(Ah)? + 3(AA)8 + OA) 


this may be written 
(14-52) z, = eAh — $ (AA)? + O(A*) 
or, since e4” = 1 + O(A), 

Zz, = e4"[1 — (AA)? + 0(A4)]. 
By virtue of nh = x it follows that 

zt = e47[] — 3-xAth? + 0(h)]. 


Since z, = O(h), z3 = O(A") is very small compared with z?. We thus 
obtain from (14-50), neglecting less important terms, 


(14-53) Yn ~ e4% — 53; A8xe4th? — c,e4*, 
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Here the first term is clearly the desired exact solution. The second term 
arises as a result of approximating the integral in (13-6) by a discrete 
formula. In the general case it would be 


— Cy. A*t2xh** 142, 


The third term is a function of the choice of our second starting value y,. 
It would be zero if we succeeded in choosing y, = z,. It is small but 
different from zero, if we take y, = e*", the value of the exact solution 
atx,. The significant fact is that both errors, the one due to discretization 
as well as the starting error, contain the factor e4*%. This is particularly 
important when A is negative, e.g.. when A = —1. In this case, the 
errors do not grow indefinitely; the discretization error, for instance, 
grows up to x = | and then tends to zero, like the solution itself. 

Let us now examine the integration of the same problem by a different 
integration formula. As the error of the interpolating polynomial is 
smallest near the median interpolating point, one is always tempted to 
use symmetric central difference formulas for numerical integration. To 
consider the simplest case, let us integrate the identity 


y'(x) = f(x, >) 


between the limits x,_, and x,,,, using the midpoint rule (see problem 4, 
chapter 13) to approximate the integral. There results the formula 


(14-54) Yn+1 = Yn-1 + 2hfn. 


This formula could be used, in principle, much like the Adams-Bashforth 
formula. For the special initial value problem (14-48) we obtain the 
difference equation 


Yn+1 = 2ZAhyn + Yn-1- 
Its characteristic polynomial 

P(z) = 2? — 2Ahz — 1 
has the two distinct zeros 

z, = Ah + V1 + (Ah)? 


and 
1 


22 aa 2h 
All solutions satisfying yp = 1 can thus again be represented in the form 
(14-50). 
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For the purpose of exploring the asymptotic behavior of the solution y, 
as h > 0 while nh = x remains fixed, we again expand z, in powers of A. 
Using the binomial formula (see Taylor [1959], p. 479) we obtain 


Zz, = 1 + Ah + 4(Ah)? + OK) 
or, by comparison with the exponential series, 
Zz, = e4” — 4(Ah)® + O(A'). 
Proceeding as above, we thus find 
ze = e4*[1 — 4x Ah? + 0(h5)] 
and 
zh = (—2,)-" = (—1)"e" 4°[1 + 4xA%h? + 0(A9)]. 


It is seen that zj and z3 have the same order of magnitude, and that the 
term c.z3 can now no longer be neglected. We thus find from (14-50), up 
to less significant terms, 


(14-55) Yn ~ e4% — LAP xe4th? + coe4™ 4+ Co(—1)"e7 4°. 


In the leading term we again recognize the desired solution of the 
continuous problem. The second term represents the genuine discretiza- 
tion error, due to the approximation of an integral by a discrete formula. 
This term is smaller than the corresponding term in (14-53), emphasizing 
the greater accuracy of central difference formulas. The last two terms 
are ‘“‘starting” errors; they owe their presence to the fact that y, # z, in 
general. In particular both terms are present if we take y, = e4", the 
exact value. These starting errors would not be of much concern if it 
were not for the fact that the second term (—1)"c,e~ 4%, shows a behavior 
whose character is exactly opposite to that of the exact solution. If A is 
negative, this term will grow at an exponential rate and, no matter how 
small initially, overshadow all other terms if x is sufficiently large. 


EXAMPLE 
8. Table 14.8 shows some values of the errors z,= y, — e~™ of the 
numerical approximations to the solution of y’ = —y, y(0) = 1 obtained 


by the formulas (14-49) and (14-54) with the step A = 0.1. 

While for small values of x the Adams-Bashforth method has the 
larger errors (due to the fact that the discretization error is larger) we have 
at x = 5 reached a point where the error of the midpoint formula is 
larger by a factor 100, due to the exponentially growing oscillatory 
component in (14-55). 

The presence of oscillatory error terms that grow relative to the exact 
solution represents a special kind of numerical instability that is typical 
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Table 14.8 
oa 10°z, 10°z, 
Adams Midpoint 
0.0 0 0 
0.1 0 0 
0.2 38 30 
0,3 70 21 
0.4 90 51 
5.0 15 1129 
5.1 14 ~ 1385 
5.2 12 1406 
5.3 11 ~ 1665 


5.4 11 1739 


for a number of formulas for integrating differential equations. This 
phenomenon of numerical instability can be traced to the fact that the 
characteristic polynomial of the linear difference equation arising from 
integrating y’ = Ay has several zeros of approximate modulus one. This, 
in turn, will always be the case if the identity y’(x) = f(x, y(x)) is integrated 
over an interval whose length exceeds one integration step. 


Problems 


20. 


21. 


22" 
23. 


Carry out the above investigation for the integration algorithm based on 
Simpson’s formula 


Yne1 = Yn-1 + hGfner + Hfn + Afn-1)- 


Show, in particular, that the starting error now has a component of the 
form (— 1)"c2e747!, 

Integrating Bessel’s interpolating polynomial between the limits x_, and 
X2, obtain the following formula for integration: 


Ynt2 = Yn-1 + BACfn +2 + 3fnei t+ 3fn + fa-1). 


Investigate the stability of the formula obtained in the preceding problem. 
In constructing table 14.8, y, was chosen as the exact value e~". Show 
experimentally that, due to rounding errors, the phenomenon of numerical 
instability occurs also for y; = 7, = —h + V1 + h?, where theoretically 
C2 = 0. 


Recommended Reading 


A large variety of methods for integrating differential equations is 
discussed in Milne [1953]. Collatz [1960] contains a wealth of valuable 
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information. The phenomenon of numerical instability of algorithms 
for the solution of initial value problems was first discussed by Rutis- 
hauser [1952] in a brief but classical paper. A comprehensive treatment of 
errors and numerical stability will be found in Henrici [1962, 1963]. For 
the numerical solution of boundary value problems and partial differential 
equations, which had to be ignored here for reasons of space, we refer to 
the excellent treatises by Fox [1957] and Forsythe and Wasow [1960]. 


Research Problem 


Study the effectiveness of repeated extrapolation to the limit in the 
numerical solution of differential equations. In particular, compare the 
accuracy obtained with repeated extrapolation to the limit in the Euler 
method with that given by the Runge-Kutta method and the Adams- 
Moulton method, using the same number of evaluations of f. 


PART THREE 


COMPUTATION 


Created by Universal Document Converter 


chapter l 5 number systems 


All numbers that have occurred so far in the theoretical discussions in 
this book were real (or complex) numbers in the strict mathematical sense. 
That is, they were to be conceived as infinite decimal fractions, or as 
Dedekind cuts. For the purposes of computation such numbers have to 
be approximated by real numbers of a rather special type, such as 
terminating decimal fractions, or other rational numbers. The present 
chapter is devoted to a study of the number systems that can be used for 
the purposes of computation. 


15.1 Representation of Integers 


Let us take a look at our conventional number system. What do we 
mean by a symbol such as 247? Evidently 


247 = 2-100 + 4-10 + 7 
= 2-10? + 4-10' + 7-10°. 


The number 247 is represented as a polynomial in the base 10, with 
integral coefficients between 0 and 9 = 10 — 1. 

There is no intrinsic reason why 10 should be used as a base; the 
number of fingers may have to do with it. There is evidence that in 
cultures different from ours other number systems have been used. The 
French word quatre-vingts for the number 80 indicates a system with 
base 20. (Maybe the French counted with their toes as well as with their 
fingers.) In New Zealand, words for 11? and 11% have been found. The 
Babylonian astronomers used a sexagesimal system, i.e., a system with the 
base 60. A trace of this can be found in our dividing the circumference of 
the circle into 360 degrees. Also mixed systems, although mathematically 
much less satisfying, are in use, such as the Anglo-Saxon system for 
measuring length, and the English monetary system. 
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In electronic computation, the digits of an integer are represented by 
various states of a physical quantity, such as an electric current. The 
technically simplest situation arises when there are only two states to be 
represented, such as the state ‘“‘no current”’ and the state “‘a unit current.” 
For this reason, modern electronic computers work internally almost 
exclusively with the base 2. The resulting number system is called the 
binary number system. In this system, only the digits 0 and 1 occur. In 
order to distinguish them from decimal digits, we shall underline them. 
Thus, if a given nonnegative integer N is in the binary system represented 
in the form 


(15-1) N = a,2" + a,-12"-1 +--++ a,2 + ao 

where the a, are either zero or one, it will be written in the form 
AnAn—1---AyAo. 

EXAMPLE 

1 1 =1,2 = 10,3 = 11, 101 = 5, 8 = 1000, 1010 = 10. 


If we wish to communicate with a computer working in the binary 
system, we (or the computer) must be able to convert a number from the 
decimal to the binary system and conversely. To convert from binary 
to decimal, we regard the number WN given by (15-1) as the value of the 
polynomial 

P(x) = a,x" + an_-yx"7 1 +--+ + ax + ap 
for x = 2. To evaluate P(2), we may use algorithm 3.4. (Note that the 
coefficients of the polynomial are numbered differently now.) It follows 
that if we calculate the numbers 5, recursively by 


(15-2) bo = 4), b, = A,-~ + 2b, 1 (k = EF 2, fa ky n), 
then 6, = P(2) = N. 


EXAMPLE 
2. To express the number N = 11111001111 in decimal. The Horner 


scheme yields 


k OF ob 2. od: Ae 6 7 8 9 10 
a, 1 11 1 21 0 =O 1 1 I 1 
by, 1 3 7 15 31 62 124 249 499 999 1999 


It follows that N = 1999. 


To convert a given integer from decimal to binary, we make use of the 
fact that the /ast binary digit a) of an integer N is zero if and only if N is 
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even. The second binary digit a, is zero if and only if (n — ap)/2 is even, 
and soon. This leads to the following scheme: 


Algorithm 15.1 To find the binary representation (15-1) of a given 
positive integer N let 


No == N, 
15-3 = 
ea Ness = #5, ae (te es ee 
where 
fh if N,, is odd, 
Oe) cae 10 if N,, is even. 


Continue until NV, = 0. 


EXAMPLE 
3. To express N = 1999 in binary form. Algorithm 15.1 yields the 
scheme 


k 0 1 2 3 4 5 6 7 8 9 10 
N, 1999 999 499 249 124 62 31 15 7 3 1 
ay, 1 1 1 l 0 Oo 1 111 1 


It follows that 1999 = 11111001111. (Note that the /east significant 


digits are obtained first.) The scheme is an exact reversal of the scheme of 
example 2. 


Problems 


1. Express the numbers 1685, 1770, 1882 in the binary system. 

2. What are the decimal values of the binary numbers 1’000, 1’000’000, 
1’000’000’000? a 

3. What is, for N— oo, the ratio of the number of digits in the binary and 
the decimal representation of an integer N? 

4. State the rules for the conversion of a decimal integer N into the ternary 
system, and vice versa. Express 10" in the ternary system (m = 1, 2, 3). 

5. What is the representation of the positive integer N in the number system 
to the base N? 


15.2. Binary Fractions 


A binary fraction is a series of the form 


(15-5) 2 a4 
k=1 
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where the coefficients a_,,a_2,... are either zero or one. The series 
(15-5) always converges, because it is majorized by the geometric series 


~ 1 
~kK o 
2,? = 


ey | 
The sum z of (15-5) will also be denoted by 


NI 


1-4 


z= 0.a_,aA_2@_3. eee 


The binary fraction (15-5) is said to terminate if, for some integer n, 
a, =0,k > n. 
The following theorem is fundamental, but will not be proved: 


Theorem 15.2a Any real number z such that 0 < z $1 can be 
represented in a unique manner by a nonterminating binary fraction. 


If we drop the condition that the binary fraction shall not terminate, then 
the representation may not be unique; for instance, the binary fractions 
0.1 and Q.01111... 
both represent the number 0.5. 


A terminating binary fraction z = 0.a_,a_2...a_, can be regarded as 
the value of the polynomial 


P(x) = a_\X + G_oxX* +-+++ a_yx" 


at x = 4 and thus can be evaluated by algorithm 3.4 (Horner’s scheme) as 
follows: Let 


Bo = a4_n, by = G_n+k a 4b,-1, k= I, 2, ey A, 
where @9 = 0. Then b, = z. 


EXAMPLE 
4. To express z = 0.00110011 in decimal. Horner’s scheme yields 


k an—k by 

0 I 1 

1 1 1.5 

2 0 0.75 

3 0 0.375 

4 1 1.1875 

o 1 1.59375 

6 0 0.796875 

7 0 0.3984375 
8 0 0.19921875 


It follows that z = 0.19921875. 
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Another method for converting a terminating binary fraction consists in 
converting the integer 


2"z = a_,2"-1 4+ a_o2"-2 +---+4_, 


and dividing the result by 2”. 

Except in special circumstances, non-terminating binary fractions 
cannot be converted into terminating decimal fractions. To get an 
approximate decimal representation, we truncate an infinite binary 
fraction after the mth digit and convert the resulting terminating fraction. 
The error in this approximation will be less than 27". 

The inverse problem of converting a given (decimal) fraction into a 
binary fraction is solved by the following algorithm: 


Algorithm 15.2 For a real number z such that 0 < z < 1, calculate 
the sequences {z,} and {a_,} recursively by the relations 


241 = 2, 


p -{ if 225 ><, 
ie. if 2z, < 1 
(15-6) 0, if 2z, < 1, 


244 = Lip ax, Ke, tees 
Theorem 15.2b For the sequence {a_,,} defined by (15-6), 

z= 0.a_1a_2a_3.... 
Proof. According to theorem 15.2a, z has a nonterminating binary 
representation of the form 

z=0.b.,b_2b_3...; 
we have to show that 
(15-7) buy > asg |) OP -ee 
where a_, is defined by (15-6). Incidentally we shall also show that 
(15-8) Ze = 0.b_,b- 4-10 2---s k =1,2,.... 


The simultaneous proof of the two assertions is by induction. Clearly, 
(15-8) is true fork = 1. Let us assume that (15-7) and (15-8) are true for 
the integers k — 1 and k, where k 2 1. We then have 


22, = b_, + 0.b_,-1b_,-9. wes 


Since the binary fraction on the right is positive and bounded by one, it 
follows that 2z, > 1, and hence a, = 1, if and only if b, = 1. This 
establishes (15-7), and, by the second formula (15-6), (15-8) with k 
increased by one. 
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EXAMPLE 

5. To express z = 1 asa binary fraction. We have 
k Zi: 27; aA_, 
1 0.2 0.4 0 
2 0.4 0.8 0 
3 0.8 1.6 1 
4 0.6 1.2 1 
5 0.2 


Since z, = 0.2 has occurred before, the periodic} binary fraction 


0.2 = 0.001100110011... = 0.0011 





results. (The period is indicated by a cross-bar.) 

It is now easy to represent any nonnegative number in the binary system. 
If z is such a number, let [z] be the greatest integer not exceeding z. 
According to §15.1 we can write 


ie) = > a2, 


where the a, are either zero or one. For the fractional part we have, by 
(15-5) 


z—[z] = > a.,2° * 
k=1 


with the same limitation on the a,. Thus 


i] +(@-[d)= > a_2- 


k=-n 


N 
I 


= QnQn-1- . -Ap.A_1;A_9. oe 


To convert a binary number into a decimal or vice versa, we must convert 
the integral and the fractional part by themselves. 


Problems 


6. Give the binary value of 7, correct to 12 binary digits after the binary 
point. 

7. Express the binary numbers (a) 0.10101, (b) 0.10101 as ratios of two 

integers. 

Determine the representation of 4 in the number systems to the base 

(a) 2, (b) 5, (c) 7. 


{ It is shown in number theory that every rational fraction can be represented by an 
infinite binary fraction that is ultimately periodic. 


a 
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9. Devise an algorithm that yields directly the binary representation of the 
square root of a given positive binary number. Use your algorithm to 
determine the binary representations of V2, Vx, V5, and check your 
result by converting the known decimal representations of these numbers. 


15.3. Fixed Point Arithmetic 


As mentioned earlier, in most digital computing machines today numbers 
are internally represented in the binary system. Each element of a 
computer can assume only a finite number of (recognizable and repro- 
ducible) states, and each computer has only finitely many elements. It is 
thus clear that only finitely many different numbers can be represented in a 
computer. A number that can be represented exactly in a given computer 
is called a machine number. All other numbers can be represented in the 
computer only, at best, approximately. There are two principal ways in 
which a given number z can be represented. They are known as the 
fixed point and as the floating point representation of z. 

In fixed point operation, the machine numbers are terminating binary 
fractions of the form 


t 
(15-9) ee ae 
k=1 


where ¢ is an integer that is either fixed for a given machine (t = 35 for the 
IBM 7090, t = 48 for the CDC 1604) or, in some cases, can be selected by 
the user (e.g., on the IBM 1620). The numbers used by the machine can 
also be described as the set of numbers n-2~', where n is an integer, 
[m| < 2. The density of machine numbers is uniform in the interval 
(—1, 1) and zero outside this interval. 

If z is any real number, we shall denote by z* one of the (at most two) 
terminating binary fractions of the form (15-9) for which |z — z*| is a 
minimum. If the interval 


R= [-14+27*-},1—27'-3] 
is called the range of the machine, then 
(15-10) lz —z*| <2-*-1! forall zeER, 


that is, any real number within the range of the machine can be approxi- 
mately represented by a machine number z* with an error of at most 
2-'-1, Any of the (at most two) representations z* of a number z will 
be called a correctly rounded fixed point representation of z. 

If a and } are two numbers within the range of the machine, the numbers 
a + b do not necessarily belong to the range. If they do not belong to 
the range we say that overflow occurs. The ever-present possibility of 
overflow is one of the serious disadvantages of fixed point arithmetic. 
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The above remark shows very clearly that even if the data of a compu- 
tational problem are within range, we can usually not be sure that all 
intermediate results fall within the range. It is frequently possible, 
however, to obtain a (possibly very crude) a priori estimate of the size of 
the intermediate results. One then may try to reformulate the problem 
(for instance by introducing new units of measurement) in such a manner 
that the data as well as the intermediate and final results are within range. 
This reformulation is known as scaling. The frequent necessity of scaling 
is a further disadvantage of fixed point arithmetic. 

Let us now consider the accuracy of the arithmetical operations in fixed 
point arithmetic. If a and 5 are two machine numbers, and if a + b is 
within range, then it is clear thata + bisalsoa machine number. That is, 
addition can be performed without error in a fixed point machine, i.e., 
we have 


(15-11) (a+ b)* =a+b, ifa+beR. 


The same is not true of multiplication, however. If a and b are two 
machine numbers, then ab is, in general, not a machine number. How- 
ever ab € R, hence there exists a correctly rounded machine value of ab, 
and we have 


(15-12) \(ab)* — ab| S$ 2-'-}, 


The ratio a/b of two machine numbers is, in general, not in the range of the 
machine. (The probability for this to be the case is only 50 per cent—a 
further disadvantage of fixed point arithmetic.) If it is, we again have 


|(a/b)* — a/b] S$ 2-*"?, 


independently of the size of |a/b]. 

Many machines have built-in subroutines for calculating values of 
elementary functions such as sinx or Vx. It is clear that even if the 
argument x is a machine number, the value of the function is, in general, 
not. In the case of the sine function, the best we can hope for in general 
is a correctly rounded value of sin x, but even this ideal is seldom attained. 


The author knows of a machine where V0 was 27-¢. 


Problems 


10. Devise an algorithm for forming the sum of two positive machine numbers 
of the form (15-9). 

11. Reformulate the algorithm obtained in problem 9 in such a manner that it 
yields, for every nonnegative machine number x, a correctly rounded 


value of Vx. 
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12. Devise an algorithm that yields the correctly rounded product of two 
machine numbers a and b. 


15.4 Floating Point Arithmetic 


In floating point operation, the set of machine numbers consists of 0 and 
of the set of all numbers of the form 


(15-13) z= +m2? 


where m is a terminating binary fraction, 
t 
me > aa2-*, 4<m<l 
k=1 


normalized by the condition a_, = 1, and where p: is an integer ranging 
between —P and P, say. The integers ¢ and P are normally fixed for a 
given machine. (On the IBM 7090 computer, t = 27, P = 128.) The 
binary fraction m is called the mantissa and the integer p the exponent of 
the floating number (15-13). The machine numbers now cover a much 
wider range than in fixed point arithmetic. We again denote, for any real 
z and for a given machine, by z® any of the at most two numbers of the 
form (15-13) for which |z — z®| is minimized. We now define the 
range of the machine to be the interval 


Ra t= 2k 27 2 2), 
If ze R, |z| 2 2~’-*, we can define z® by 
(15-14) z® = sign z-m-2? 


where 
= flog, |z/] + 1, m = (27?|z|)*. 


Here log, denotes the logarithm to the base 2, [x] denotes the greatest 
integer not exceeding x, and the * refers to correct rounding in fixed point 
arithmetic. Any of the (at most two) values z® will be called a correctly 
rounded floating representation of z. If |z| < 2~?-1, we set z® = 0. 

If |z| 2 2-’~+, we evidently have 


\z oma z®| < 2P2-t-1 


= QP=19>t 
= Diog,lzlip —-t 
< ioe, lzI2 —¢ 
and hence 
(15-15) | ee a Ney a 


This relation shows: 


300 elements of numerical analysis 


Theorem 15.4 If, for a given floating point machine, ze R, |z| 2 
2-?-1, then the relative error of the correctly rounded floating 
representation z® of z is at most 27¢. 


For z — 0 the relative error tends to 00, as in fixed point arithmetic. 

Since machines with floating arithmetic can handle very large numbers 
with a small relative error, scaling is, in general, not necessary, and the 
possibility of overflow is minimal. For these reasons, floating arithmetic 
is almost universally used today. The FORTRAN system, for instance, 
employs floating arithmetic. 

Let us examine briefly the accuracy of the basic arithmetic operations in 
floating point arithmetic. If a and b are two floating machine numbers, 
and if a + b is in the range of the machine, then, unlike the fixed point 
case, a + b is not necessarily a machine number. Thus we must also 
expect rounding errors in addition. If 

a= +m2?, b= +n24, 
and if p = q, we have 
a+b= +m2? + n2% 
= (tm + n24-*)2?, 
Here |+m + n2*-?| < 2; thus we have, at worst, 


(a + b)® = (SBS lo 
and hence 
(a + b)® — (a + b)| S 27 '-12?+) 
= 2-¢+19l]og.lal 
and consequently 
(15-16) (a + b)® — (a+ db) S$ 27-**3 Jal. 


Thus the error of the machine value of a sum (or difference) is at most 
2~**1! times the absolute value of the larger summand. 

The product ab of two machine numbers a and 3 is now (contrary to 
fixed-point arithmetic) not always a number in the range of the machine. 
If the absolute value of the product lies in the interval [2-”, 2?], we find 
by considerations similar to those just given that 


(15-17) |(ab)® — ab] S$ 27‘lab]. 

Similarly we find for division, if a/b is in the range of the machine, 
a\® a _,{a 

(15-18) (5) ~§| s 2-2} 














Concerning built-in subroutines for elementary functions, the remarks 
made above apply also to machines with floating point arithmetic. 
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Multiple precision arithmetic. As was noted above, both fixed point 
numbers and the mantissae of floating numbers are represented in the 
machine in the form N x u, where the number u = 27‘! may be called the 
basic unit of the machine, and where N is an integer, |N| < 2'. For 
some machines and/or some problems this approximation is not sufficiently 
accurate. In such cases one can increase the accuracy by working instead 
with numbers of the form 
(15-19) Niu + Nou? +---+ Nu, 
where the N,, are integers satisfying 0 < N, <u}, k =1,2,...,4 
This enlargement of the set of available machine numbers is called working 
with q-fold precision. Double precision (q = 2) is fairly common; on 
some machines, its use is encouraged by built-in circuitry to facilitate 
arithmetical operations with multiple precision numbers of the form 
(15-19). Multiple precision with g > 2 is normally restricted to special 
experimental programs. From a mathematical point of view, multiple 
precision is equivalent to single precision with ¢ replaced by qf. 


Problems 


13. Prove the inequalities (15-17) and (15-18). 
14. Assume that in the approximate formula of numerical differentiation 


Daflxa) = BSE 


f;, and f_, are known only with errors $ se 16. whereas / is an exact binary 

number. 

(a) If floating operations are used, how big is (at most) the resulting error 
in Daf(Xo)? 

(b) If the function to be differentiated is f(x) = Jo(x), for which value of h 
is the maximum error in the numerical value of Jo(x) due to rounding 
equal to the maximum error due to numerical differentiation ? 

(c) For which value is the sum of the two errors a minimum? 


Suggested Reading 


Wilkinson [1960] gives a careful account of rounding errors in the 
elementary arithmetic operations. For a wealth of detail from the point 
of view of computer technology see Speiser [1961]. 


Research Problem 


Assume that in a fixed point machine it is possible to identify both the 
more and the less significant half of the exact value of the product of two 
machine numbers. Formulate an algorithm for obtaining the correctly 
rounded value of the product of any two complex numbers whose real 
and imaginary parts are machine numbers. 
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16.1 Introduction and Definitions 


The process of replacing a real number z by a machine number z* or z® 
is called rounding. The difference z* — z or z® — z is called rounding 
error. Due to rounding, the final result of a numerical computation 
usually differs from the theoretically correct result. The difference of the 
numerical result and the theoretically correct result is called the accumu- 
lated rounding error. In order to distinguish them from the accumulated 
rounding error, the rounding errors committed at each step of a compu- 
tation will also be called the /ocal rounding errors. Except in simple 
cases, the accumulated rounding error is not merely the sum of the local 
errors. Each local rounding error is propagated through the remaining 
part of the computation, during which process its influence on the 
accumulated rounding error may either be amplified or diminished. 

Different numerical procedures for solving the same theoretical problem 
(such as integrating a differential equation) may show a different behavior 
with respect to the propagation of local rounding error. This varying 
sensitivity with regard to rounding off operations is frequently referred to 
as the numerical stability of the process. In chapter 8 we noted that the 
numerical stability even of one and the same numerical procedure (the 
Q D algorithm) can depend strongly on the way the arithmetical operations 
are performed. 

In the present chapter we shall study the propagation of local rounding 
error in some simple but typical cases. 


16.2 Finite Differences 


As a first example we consider a case where the accumulated rounding 
error is due exclusively to errors in the data of the computation; all 
intermediate arithmetic operations are performed exactly. Let x, = 
302 
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a+nh, n=0,1,2,..., let f be a given function, and let f, = f(x,). 
We consider the effect of replacing the exact values f, by machine values 
J; on the differences of the sequence of values f,. 

In fixed point arithmetic, we clearly have 


(16-1) Je =f, + & m= 012i 
where the local rounding errors «, satisfy 
(16-2) len] S e = hu, n= 05 As 2s vx 
u being the basic unit of the machine (or the table of f). According to 
(6-27) 
k 
A*f, = 26 1)*- °(, \F n+tms 
k 
A" {* = 2a —1)*- Ale ise 


Hence, in view of (16-1), if we denote by r& the accumulated rounding 
error in the kth difference, 


hp = Atfs — Af, = AMG — fy 
£0 fh 


It follows by (16-2) that 


k 
[re?| Ss > (7, Jenn 


m=0 


«2 (n) 


k 
2 (7) = (1+ DF = 2, 

we obtain for r the bound 

(16-3) le Soe 

This bound cannot be improved in general; equality holds whenever 


Entm = (- 1)™e. 


ns 


In view of 


EXAMPLE 
1. We consider an excerpt of a table of sines, tabulated with a step 
h = 0.01 to six decimal places. In view of 


A*y,, = hky(&), Xn < g < Xntk 
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(see problem 11, chapter 11), the exact differences satisfy 


|A*y,| < 10-2, k = 0,1,2,... 


and are zero from 4* on to the number of digits given. The following 
table shows the numerical differences 4*y*: 


Table 16.1 
Xn yt = (sinx,)*  Ayk Aye —Aey= «Aye yk 


0.25 0.247 404 


0.26 0.257 081 9677 aq 

0.27. 0.266 731 9650 _45 2 6 

0.28 0.276 356 9625 _»9 ~—4 5 1] 
0.29 0.285 952 9596 _4g i =3 a 
0.30 0.295 520 9568 = _yg lI ij I 
0.31 0.305 059 9539 yy 2 1 2 
0.32. 0.314 567 9508 30 I 1 0 
0.33 0.324 043 ay _32 : 4 —2 
0.34 0,333 487 aa 33 


0.35 0.342 898 


The example shows that the local rounding errors «,, although hardly 
noticeable in the function values f, themselves, show up very strongly in 


the higher differences. The propagation of rounding error is real. 
Problems 
1. A tablemaker wishes to add, in a table of f(x) = J,(x) giving eight decimal 


digits after the decimal point, values of 5?f, and 4*f, that differ from the 
exact values by at most 0.6-:10°-®. To what accuracy does J.(x) have to 
be calculated? Does the result depend on the step A or on the function f? 


. The impossibility of making a correctly rounded table. A _ recurrent 


nightmare of tablemakers is the following situation. Assume a table of a 
function f is to be prepared, giving six digits after the decimal point. For 
a certain value of x,, a very accurate computation yields 


f(Xn) = 0.123456499996, 


with an uncertainty of 7-10-17. Should this value be rounded up or 
down? A classical result in number theory states that if @ is an irrational 


number (such as V2 or e), the numbers 
an — [an], n= Oo 1g 2y ss 


come arbitrarily close to every number in the interval (0, 1). Show that 
this result implies the following: Given any two positive integers N and M, 
and any finite procedure for calculating square roots, there always exists 
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an interval such that it is impossible to construct a correctly rounded 
table of the function f(x) = V2x with the step 10-™ in that interval, 
giving N digits after the decimal point. 


16.3 Statistical Approach 


Example | above shows that the propagated rounding error in a table 
of differences can be considerable. It also shows that the errors in the 
differences probably rarely have the maximum values given by (16-3) 
(8 and 16 for the last two difference columns). This is due to the fact that 
consecutive local rounding errors «, only rarely have the maximum value 
permitted by (16-2) and, in addition, occur with strictly alternating signs 
such as e, —e,e, —e,.... Such an occurrence would contradict the 
intuitive notion that the local rounding errors are, somehow, distributed 
in a random fashion. 

It is plain that, on a given machine and for a given problem, the local 
rounding errors are not, in fact, random variables. If the same problem 
is run on the same machine a number of times, there will result always the 
same local rounding errors, and therefore also the same accumulated 
error. We may, however, adopt a stochastic model of the propagation of 
rounding error, where the local errors are treated as if they were random 
variables. This stochastic model has been applied in the literature to a 
number of different numerical problems and has produced results that are 
in complete agreement with experimentally observed results in several 
important cases. 

The most natural assumption concerning the local rounding error in the 
stochastic model seems to be the following: We assume that the local 
rounding errors are uniformly distributed between their extreme values 
—e and e. The probability density p(x)—defined as (4x)~} times the 
probability that the error lies between x and x + 4x—of this random 
distribution is given by 

0, x < —6, 
(16-4) P(x) = <c, —-eSxSe, 


0, Xx > & 


The constant c is determined by the condition that the sum of all 
probabilities, that is, the integral 


[op ax 


equals 1. This yields the value 


in (16-4) (see Fig. 16.3a). 
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g=1 
(b) Normal probability distribution 


Figure 16.3 


For theoretical purposes it is more convenient to assume that the local 
rounding errors are normally distributed. The normal distribution is 
defined by the probability density 


(16-5) p(X) = = 


Here o is a parameter, called the standard deviation of the distribution, that 
measures the spread of the distribution. For small values of o the 
distribution is narrowly concentrated around x = 0, with a sharp peak at 
x = 0. For large values of o the distribution is flattened out (see Fig. 
16.3b). 

The absolute value of a normally distributed random variable exceeds o 
in only 31.7 per cent of all cases and 2.5760 in only | per cent of all cases. 

Both distributions considered above have mean value 0. For a random 
variable € with arbitrary probability density p(x), the mean or expected 
value is defined by 


(16-6) Ee = | 


—(72 2 
e (xt]Zor) 





oo 


xp(x) dx. 
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The standard deviation is defined as the square root of the variance of €, 
which in turn is defined as the mean of the square of the deviation of é 
from its mean. Thus, if» = E(&), 


i 3] 


(16-7) var (é) = | (x = wns) dx; sd. (OD = Vvar ©. 


—-@ 


EXAMPLES 
2. The variance of the distribution defined by (16-4) is 
€ 1 e2 
2.22. et 
| 7 x? 5, ax 3 


3. The variance of the normal distribution is 


ae x2e@- 27120? dy = Ze {iF t2e-t de = G7, 
V 20a J— 0 a JO 

Thus, if we wish the distributions (16-4) and (16-5) to have the same 
standard deviation, we must choose 


€ 
oF=- > 
V3 
If £,, 2,..-, &m are random variables with means py, o,..., Um, and 
if a1, Qg,..., Am are arbitrary constants, the quantity 


(16-8) E = a6, + Agég +++ + On€m 
is again a random variable, and its mean is given by 
(16-9) E(€) = Qypy + Aghe +++ + Anke 


The importance of the normal probability distribution (16-5) is based on 
the following fact: If the variables £,, €.,..., &, are independent (i.e., if 
the value of any é; has no bearing on the value of any other é;), and if 
they are normally distributed with standard deviations 0, og,..., 6m, 
then the variable defined by (16-8) is also normally distributed, and its 
standard deviation is given by 


(16-10) s.d. (é) = Va3o? + ao2 +---+ a%o%. 


In a limiting sense, the above statement is also true if the random variables 
£, themselves are not normally distributed. 

Let us now apply the above results to the problem of forming differences 
considered in §16.2. Assuming that the local rounding errors «, are 
independent, normally distributed random variables with mean zero and 
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standard deviation o, we find that r* is a normally distributed variable 


with variance 
k k 2 
var (7) = o? > ( ) 
m=0 m 


Comparing coefficients of x* in the identity 


[e+ (tpt Ged) + Ge = + bo | 
ae -€ ee tooo (ag) 


2 (n) = Ce) = Gene 


=0 


we find that 


We thus obtain 


var (r@) = rade ; 
s.d. (7) = 00) o 


In order to obtain an intuitive notion of the size of the coefficient 
appearing here, we approximate the factorials by Stirling’s formula 
(see §9.3): 


2k 
(2k)! V4ak () a 
(kD? ~  2nk ie WV ak 


This yields 
(16-11) s.d. (rf) ~ 


2*a = Ke 

Wak V3Wak 

in the case of the rectangular distribution (16-4). This relation shows 
that the ratio of the standard deviation and the theoretical maximum 


2*e is not as small as might be expected. 





EXAMPLE 

4. Twenty-four consecutive values of the exponential function given in 
Comrie’s table (Comrie [1961]) gave the following experimental values of 
the standard deviations of the differences 4*y, (in units of the least 
significant digit): 


Table 16.3 


k 3 4 5 6 


experimental s.d. 1.31 2.28 4.22 7.64 
theoretical s.d. (16-11) 1.29 2.42 4.57 8.78 
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Problem 


3. For m given machine numbers ay, az, ..., a, the sums 


Se = a? + ak +--+ 4+ aZ, 
Sc = Sy + Sg terest Sy eS 2 8 FO) 


are formed by means of the recurrence relations 
So= 0, Se = Sea + a; = 80 = 0, Se = Se-1 + 


(k = 1,2,...,”). What is the resulting standard deviation in s, and sy, 
if the rounding errors at each step are considered independent random 
variables with the distribution (16-4)? What are the theoretically possible 
maximum errors? 


16.4 A Scheme for the Study of Propagated Error 


In the example considered above, the accumulated error was due 
exclusively to the rounding of the initial data of the computation. All 
intermediate computations were performed exactly. We shall now discuss 
a scheme of fairly wide applicability which permits us to take into account 
rounding in the intermediate results. 

Many numerical algorithms consist in generating a sequence of numbers 
Gos V1 Ya»... that are defined by recurrence relations of the form 


(16-12) Gn = QnlGos 41s G20--+2Gn-1)) n= 1,2,.... 


In actual numerical computation not the numbers q, are generated, but 
certain machine numbers @, (not necessarily the correctly rounded values 
g*!), which satisfy the recurrence relations 


(16-13) Gn a 0,90 Gi; sey Gn-1)s i= l, a. sees 


Here the symbols 0, denote rounded and otherwise approximated values 
of the functions Q,. Instead of working with relation (16-13), which is 
difficult mathematically, we write (16-13) in the form 


(16-14) Gn = Ono» Gis --+2Gn-1) + &n- 
We consider this relation as the definition of the local rounding error e,. 
Thus, 
| ae O,(Gos Gis ++ +9Gn-1) — QnGor G1» ++ +s Fn-1)- 
By analyzing the computational process used to evaluate Q,, some state- 


ment can usually be made about size and distribution of the e,. 
Let now the accumulated rounding errors r, be defined by 


(16-15) rn =n — Qn WO 12> ick 
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By subtracting (16-12) from (16-14) we find 
{= 0:90; G1, sty Gn-1) 77 29, Jir- +s Gn-1) + &y 


or, using the mean value theorem, 


(16-16) ty = Ory + Ory Hee + OP-Prnad + ep 
where Q denotes a partial derivative, 
gy = 22x, k=0,1,....n—-1, 
Od 


taken at some point between (go, g1,-.-,Gn-1) and (Go, Gi, -- «+ Gn—1): 

If the functions Q,, are linear in the q,,, the partial derivatives QO“ do not 
depend on the previous rounding errors ro, r,,..., 7,1, and (16-16) then 
represents a /inear difference equation for the quantities r,. If the QO 
are constants, and if 0% = 0 for k < n — m, then (16-16) is a difference 
equation of order m with constant coefficients and can be solved by the 
method by §6.7 and §6.8. There results a solution of the form 


(16-17) ree > ders 
m=0 


where the coefficients d,,, themselves satisfy a certain difference equation. 

A solution of the form (16-17) also must exist in the general case of a 
recurrence relation of the form (16-12) with linear functions Q,, since all 
é, enter into (16-16) only linearly. In order to determine the coefficients 
dim We substitute (16-17) into (16-16). There results 


= ON dooeo 
+ OW(dioeo + 41161) 
+ OP (daoeo + dare: + doe) 
+ cee 
B3 On (dn -1, 0€0 + dyiayt1 +++ + dy-1,n-12n-1) 
+ &,. 
The expression on the right must be identical with (16-17) for all possible 


values of €9, €1,...,&,- It follows that the coefficients of corresponding 
€, Must be equal. Comparing the coefficients of «, we immediately find 


(16-18) d,.., = 1, n=0,1,2,..., 
and hence, comparing the coefficients of &, 1, &,-2).- +» £05 
dy, a= Overs» 


(16-19) ae 2> Ore "des 1,n-2 3 Ore a 


d,,.0 = o- Dd, 1,0 + ee Dd. 2,0 a as Ydi,0 am 0.” 
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These relations have the form of difference equations for the d,,, with 
respect to the first subscript and can be solved in special cases. 
We summarize the above result as follows: 


Theorem 16.4 If (16-12) is a linear algorithm, i.e., if 
(16-20) Gn = Qn'do + Qnrgy +-°++ QE MGn-1 + Pn 


for certain constants QO? and p, (n= 0,1,2,...; k =0,1,..., 
n — 1), then the dependence of the accumulated rounding errors on 
the local rounding errors is given by (16-17), where the coefficients 
d,m are determined by (16-19). 


The coefficients d,,, may be called influence coefficients, because they 
indicate the influence of the local errors on the accumulated error. 

Once the influence coefficients have been calculated, the relation 
(16-17) may be used to make either a deterministic or a probabilistic 
statement about the accumulated rounding error. Assuming 


(16-21) len| Se, m=0,1,2,... 

we find 

(16-22) lee ee delic> SOAS, ots 
m=0 


The bound on the right is attained for ¢,, = e sign d,,, and thus cannot be 
improved. Assuming that the local errors are independent random 
variables with 


(16-23) Elem) = pB, var (é,) = 07, m= 0,1, 2,... 
we find for the random variables r,, using (16-9) and (16-10) 


EG) =» D, dam 
(16-24) a 
War (A) Se? Dai eS OT 2 hei 
m=0 


In the case where the expected values and variances of the local errors 
depend on m (which is the appropriate assumption for floating point 
arithmetic) the assumptions (16-21) are to be replaced by 


(16-25) Elem) = bm var (€,) = o2, {ea 1) 
yielding 
E(r,) = > Bm dams 
m=0 


n 


var (r,) = > a? d?.,. 


m=0 
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In a qualitative sense, the above results remain even true when the 
functions Q, are not linear. The derivatives Q% in (16-16) are then 
evaluated at (Go, q1,--->@n-1), and terms of higher order in Taylor’s 
expansion are neglected. This can frequently be justified if an a priori 
bound for the accumulated errors r, is known. 


16.5 Applications 


We now shall apply the results of §16.4 to a number of special algorithms. 
Most of these have already been discussed in preceding chapters. For the 
sake of simplicity we shall assume that fixed point arithmetic is used. The 
hypotheses (16-21) and (16-23) are then appropriate. In many cases the 
results remain qualitatively true for floating point arithmetic. 

(i) Evaluation of a sum. We begin by considering the extremely simple 
example of evaluating numerically the sum 


N 
Ss > An 
n=1 


where dj, @o,...,@y are given real numbers. This can be put into 
algorithmic form by setting 


= Go = 0, 
Coen aren n=1,2,...,N. 


Clearly, gy = S. We thus have 
0,(90; Gis++ss Gn-1) = Gn-1 + Qn. 


In actual computation, in place of the exact numbers qg, numerical 
values q, are generated according to the scheme 


Go = 0, 
ue oy) f = Gn-1 + an. 


(In floating arithmetic, the second of the relations (16-28) would have to 
be replaced by g, = (Gn-1 + a®)®.) According to our scheme, we 
replace the second relation (16-28) by 


Gn = Qn-1 + An + &n; 
showing that the local error «, is given by 
Ey = ax — a, 


and thus, in fixed point binary arithmetic, satisfies (16-21) with e = 27'7}, 
or (16-23) with 


(16-29) pe = 0, @ = ae2- 7% 
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In view of 
O® = 0, k=0,1,2,...,” — 2, 
or) == | 


the relations (16-19) reduce to 
An.n-1 = 1, dn. = Gn-1,k» n>k 


and thus yield d,,, = 1,n 2 k. Relation (16-17) thus shows that 
(16-30) oe 
m=1 


(In the present simple example this result could have been obtained 
directly, see problem 3.) We thus find from (16-22) and (16-24) 


(16-31) lra| Sone, 8. (rn) = Vino. 


The result shows that while the theoretically greatest possible error in a 


sum of n terms grows like n, the standard deviation grows only like Vn. 
(ii) Iteration. We next consider the algorithm of solving the equation 


x = f(x) 
by determining the limit of the sequence {x,} defined by algorithm 4.1: 
x, = I (Xn- 1). 
Here the functions Q,, in (16-12) are given by 


0,(4o, is-++> Gn-1) oa f(Qn-1): 


To simplify matters we shall consider only the idealized situation where 
f(x) =ax+b 


and where a and b are constant machine numbers. For convergence 
we must assume (see theorem 4.2) that |a| < 1. On the machine, the 
theoretical recurrence relation 
Qn = AGn-1 + b 
is replaced by 
Gn = (4Gn-1)* + 8. 
Writing the latter relation in the form 
Gn = aGn-1 So b — Ens 
we see that 


&, = (ag, -1)* = aGn-1- 
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Thus, in fixed point arithmetic, (16-21) again holds with e = 2-'~?, and 
(16-23) with the values (16-29). In view of 


O=0, k=0,1,...,2—-2, 
or-» =a, 


the recurrence relations (16-19) simplify to 
a. =|, dn. m = Adn—1.m (n > m2 0) 


and permit the immediate solution 


dng = a -", n=m,m++1,.... 
It follows that 
n 
r, = > a"~™e,. 
m=0 


In view of |a| < 1 we thus find 








1 _ lajt 1 
. Pap eee bed Oe 
(16-32) lrn| Se Talal <imq° 
and for the stochastic model 
_— qent2 1 
(16-33) var (r,) = o? =o = o? ages 


The bounds (16-32) and (16-33) are remarkable for the fact that they are 
independent of n, the number of iteration steps. An algorithm with this 
property must be called stable under any reasonable definition of this 
term. Newton’s method in particular, which corresponds to the case 
a = 0 of the above simplified model, enjoys an extreme degree of stability. 

(iii) Generating cos np and sin np by recurrence relations. As our next 
example, we consider the algorithm for generating the sequences {cos ny} 
and {sin ng} by the algorithm suggested by example 3, §6.3. (These 
sequences are required, e.g., for computing the sums of Fourier series.) 
The difference equation 


(16-34) t, — 2 COS pf,-1 + fh-2 = 0 


has the solutions t, = cos np and ¢, = sin np and can therefore be used 
to generate these functions recursively if the values of cos g and sin ¢ are 
known. Here the theory of §16.4 applies with 


O(to, biicunss be) = 2 COS Ply-1 — fn-2. 
In numerical computation, (16-34) is replaced by 


(16-35) 7, = (2608 of, -1)* — fray 
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where Cos denotes a machine value of cosy. Writing (16-35) in the 
form 


fi, = 20S pin-1 — fp-2 + €ns 
we see that ay 
€, = (2.cos of,-,)* — 2 CoS gf, -3. 


Adding and subtracting 2 60s gf, - 1, we also have 
&, = (208 gy _1)* — 2608 Gin_-1 + 2608 H — COS fy -1. 


Here we see that «, is due to two sources: (a) rounding the product 
2€03 gf, -1; (b) replacing cos p by Cos @. Both errors can be estimated 
and permit us to make assumptions such as (16-21) and (16-23). 

To determine the d,,, we observe that 


OM =0, k=0,1,...,n—3, 


Oa —1, 
O%-) = 2 cos. 


The relations (16-19) thus yield 


diam ad l, 
(16-36) An+i.m = 2 COS 9, 
dinm = 2 COS Ody~1,m — An-2,m n2m+2. 


These equations show that, for a fixed value of m, the quantities d,_ 
themselves are a solution of the difference equation (16-34), with the 
initial values given atn = mandn =m-4+ 1. We thus must have 


dim = Am cos np + B, sin ng, n=m,m+1,..., 


where A,, and B,, are to be determined from the first two relations (16-36). 
Some algebra yields (if 0 < » < 7) 


_sin(m — l)p 


_ cos(m — 1)p 
sin — ; 


An = 
. sin p 


Bn 


and we thus get, after some further simplification, 


Si _ 
= a es n= m. 


Since e) = 0, the general theory thus yields 


Ens: 


pins 52 n—m + |)p 
E m=1 Sin p 
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Using the crude estimate |sin (7 — m + 1)p| S 1, we thus have 


he 
ry > 
sin 





lr,| S nS lyeleeses 


Using the known formula 


Sn yo _ 2 , 1 sin(2n + Ie 
>, Ging? = 3 +4 4sing 


we get for the variance the expression 


gx (sing —m+t vey’ 
var (r,) =o = (oon 5 
eee eee cE 
~ (sing)? [2 ° 4 4 sin 9 
As in the algorithm for finding the sum of m numbers, the standard devia- 


tion of the accumulated error grows only like “/n, while the rigorous 
estimate grows like n. This relatively high stability of the recursive 
method for generating cos np and sin np (or, what amounts to the same, 
the Chebyshev polynomials) has been observed experimentally (Lanczos 
[1955]). 

(iv) Horner’s scheme. We next consider the propagation of error in the 
evaluation of a polynomial by Horner’s scheme (algorithm 3.4), assuming 
that the coefficients do, a,,..., @y of the polynomial and the value of the 
variable x are machine numbers. The scheme then consists in generating 
a finite sequence of numbers qo, q:, ..-, Gy by the formulas 


7 do = 4, 
(16-37) ae Re Ne 2e cag NN: 


the desired value is gy. The functions Q, are evidently given by 


On(Gos Qt» +++» Qn-1) = An + XGn-1- 


In numerical work, the second relation (16-37) is replaced by 7, = a, 
+ (xG,-1)*. If the last term is written as a, + xgG,_1 + &,, we see that, 
as in (ii) above, the local rounding error equals the rounding error in the 
fixed point multiplication xg,_,. Thus the assumptions (16-21) and 
(16-23) are again justified with e = 2~'~1, nu = 0, o? = 442-74. We now 
have 
Of = 0, Ki), 1, SS 2, 
Qn-) — x, 


and thus find, as above, 


n= x°-%, n2m2Q, 
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Since e) = O we thus have 


and find in the usual way 


ja[e 4 
lrol $4 PT” mei 
ne, x1, 


and, under the statistical hypothesis, 


oT 
eu: (r,) = fe 7 l 0, x a l, 
Vno, x=. 1. 


Thus, for the evaluation of high degree polynomials in fixed point arith- 

metic, Horner’s scheme is unstable when |x| > 1 and stable when |x| < 1. 
(v) Numerical integration of differential equations. We finally consider 

the numerical integration of the simple but typical initial value problem 


y=Ay, yO)=1, 


where A is a real constant, A 4 0, (a) by Euler’s method; (b) by the 
integration scheme based on the midpoint formula discussed in §14.8. 
The approximate values generated by Euler’s method satisfy 


Yo = l, 
Yn = Yn-1 + Ahy,-1, a= Vie De dctie 


This is an algorithm of the form (16-12) where O% = 0 (k = 0,1,..., 
n— 2), O7-? = 1+ Ah. If the local errors are defined by 


Va = Yum + Ahyn-1 + &,; 


we find, by computations similar to the ones carried out earlier, that the 
accumulated error r, = ¥, — y, can be expressed in the form 


r, = , (1 + Ah)" "en. 


m=1 
We thus obtain 
(1 + Ah)® — 1 
Fa ee Bk Ly Mees 
Irn | = hA é, 
(1+ AA)?" -1, 


eae) ara a 
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Some simplification is possible by observing that for h > 0, nh = x fixed 
and positive, (1 + AA)" = e47 + O(A). We thus find 


AX __ 
el 'S 5 [f + 0), 





(16-38) 


2 2ZAX __ 
var (r,) = © sp- - o(n)]. 


These relations show that Euler’s method is numerically stable if A < 0, 
provided that / is sufficiently small. The method is unstable if A > 0, but 
this instability is not serious, since then the solution itself grows exponen- 
tially. Analogous statements hold for all methods based on Taylor’s 
expansion, and also for all Adams’ methods. 

If the midpoint rule is used, the values y, satisfy 


yo=l, yy, = e** (ideally) 
Yn —~ Yan-2 = 2hAyn-1, n= 2. 


This is of the form (16-12) where 


2,( Vo, Vis-- Vacs) ae 2hAYn-1 + Yn-2 


A consideration similar to that under (iii) above shows that the coefficients 
dim Satisfy, for a fixed value of m, the difference equation 


(16-39) dam — 2hAdy-1,m — Gn-2,m = 0 (n 2 m+ 2) 
and the starting conditions 
(16-40) dam = 1, Qin+1.m = 2Ah. 
The characteristic polynomial p of (16-39) is given by 
p(z) = 22 — 2hAz — 1; 
its two zeros are 


z, = Ah + V1 +4 (Ah)? = e4 + 0(h), 
Zoo, 
The solution d,,, of (16-39) satisfying (16-40) is found to be 
d e a a ela 


Zi — 22 


A somewhat elaborate but elementary computation using the approxima- 
tions (see §14.8) 


z= e474 Q(h), 28 = (—1)"e 47 + (A), 
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valid for h-» 0, nh = x fixed and positive, shows that 


(16-41) Ir,| < «> ld nml = > (Ss : oH) 


and, if var (e,) = o°, 


(16-42) var (74) = 7 [Set + ov). 


These relations confirm the result of §14.8 that the midpoint formula is 
always unstable, also when A < 0. It produces exponential growth of 
the rounding error also when the exact solution is exponentially decreasing. 
It is clear that such a method cannot be used for the numerical integration 
of a differential equation such as y’ = —y. In a qualitative sense, this 
negative result is true for all methods based on numerical integration where 
the numerical integration is performed over an interval comprising several 
steps. Simpson’s rule, in particular, if applied to the numerical solution 
of differential equations, also suffers from this kind of instability, see 
problem 11. 


Problems 


4. In floating arithmetic it is frequently permissible to replace the statistical 
hypotheses (16-23) by 


(16-43) E(Em) = 2m; var (Em) = 0°Gin, 


where p» and o are again constants. Discuss the propagation of error in 
the computation of the sum 


(16-44) Qn = A, + Ae +-+++ any, 


where a, > 0,7 = 1, 2,..., by the algorithm described under (i) above, if 
floating arithmetic is used. Show that, contrary to the fixed point 
arithmetic case, the propagated error depends on the order of the terms 
in (16-44), and that both the maximum error and the standard deviation 
are minimized if the terms are summed in increasing order. 

5. Discuss the propagation of rounding error if the geometric series > r= a", 
where |a| < 1, is summed in floating point arithmetic, generating a” by 
the formula a" = a(a"~"). Is the resulting process numerically stable? 

6. The Fibonacci numbers (see example 7, §6.3) are generated from their 
recurrence relation 


X=xm = 1, Xn = Xn-1 + Xn-2 


in floating arithmetic. Assuming that the relations (16-43) hold with 
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10. 


11. 


elements of numerical analysis 


Gn = Xn, what is the standard deviation of the ::umerical value x, for large 
values of n? If the x, are used to compute the number 


1+ V5 _ 5.) Sasa 
2 n+0o Xy 


is this a stable process? 

(Generalization of problem 6.) Discuss the propagation of rounding 
error in Bernoulli’s method for determining a single dominant zero of a 
polynomial whose coefficients are exact machine numbers. Discuss both 
floating and fixed point arithmetic. 


. Discuss the propagation of rounding error in algorithm 5.6 for extracting 


a quadratic factor, assuming fixed point arithmetic. 


. Let, for non-integral s > 0, the binomial coefficients q, = ( ) be generated 


S 
n 
by the recurrence relation 


s—nt+1] 
qo = l, 5 ie ara, | n= | Pe eee 


Assuming fixed point arithmetic, show that 


var (rn) = o7S,(s), 


5) = [CP > [ny - 


(It can be shown that S,(s) = n/(3 + 2s) + 0(1) asn— o-.) 

Study the propagation of rounding error in algorithm 12.4 (repeated 
extrapolation to the limit), if the same error distribution (16-23) is assumed 
in the zeroth column A, and in the relations defining the elements of the 
(m + 1)st column in terms of those of the mth column. 

Show that if the problem discussed under (v) above is solved by Simpson’s 
formula 


where 


h 
Yn+1 — Vn-1 = 3 ata + 4f, + fa-1) 


in fixed point arithmetic, the standard deviation of the accumulated 
rounding error behaves for x large like cosh (4Ax). 


Recommended Reading 


The detailed study of the propagation of rounding error in numerical 
computation is of recent origin. The classical paper is by Rademacher 
[1948]. Very detailed accounts of the propagation of error in a large 
number of methods for solving ordinary differential equations are given 
in the author’s books [1962, 1963]. 
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Research Problems 


1. Make a study, both experimental and (as far as possible) theoretical, 
of the propagation of error in the Quotient-Difference algorithm. 

2. Discuss the stability of Horner’s scheme in floating point arithmetic. 
3. The hypotheses underlying the statistical theory of propagation of 
rounding error have been criticized as unreliable (Huskey [1949]). Carry 
out some statistical experiments on rounding error propagation and 
compare the results with the theoretical results given above. (See the 
author’s books quoted above for some ideas on how to perform such 
experiments.) 
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non-local convergence theorem for, 
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245 
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Numerical integration, 246 ff. 
error of, 247 
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using backward differences, 248 
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approximation of, by polynomial of 
lower degree, 193 ff. 
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Polynomial equation, 6 
Predictor formula, 281 


Quadratic factors, 39 
determination of, 108 ff. 
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Rolle’s theorem, 187, 224 
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Roots of unity, 28, 37 
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Runge-Kutta method, 
classical, 275, 277, 281 
simplified, 274 
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Taylor algorithm, 267, 270, 271, 272 
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Taylor’s formula, 270, 272 

Theorem, role of, in numerical analysis, 

10 
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